Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.
Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.
Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.
For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors.
NLSY79.csvbrca_subtype.csvbrca_x_patient.csvSelf-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.
In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.
The data is store in NLSY79.csv.
Here are the description of variables:
Personal Demographic Variables
Household Environment
Variables Related to ASVAB test Scores in 1981
| Test | Description |
|---|---|
| AFQT | percentile score on the AFQT intelligence test in 1981 |
| Coding | score on the Coding Speed test in 1981 |
| Auto | score on the Automotive and Shop test in 1981 |
| Mechanic | score on the Mechanic test in 1981 |
| Elec | score on the Electronics Information test in 1981 |
| Science | score on the General Science test in 1981 |
| Math | score on the Math test in 1981 |
| Arith | score on the Arithmetic Reasoning test in 1981 |
| Word | score on the Word Knowledge Test in 1981 |
| Parag | score on the Paragraph Comprehension test in 1981 |
| Numer | score on the Numerical Operations test in 1981 |
Self-Esteem test 81 and 87
We have two sets of self-esteem test, one in 1981 and the other in 1987. Each set has same 10 questions. They are labeled as Esteem81 and Esteem87 respectively followed by the question number. For example, Esteem81_1 is Esteem question 1 in 81.
The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree
Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?
data.esteem=read.csv("data/NLSY79.csv")
dim(data.esteem)## [1] 2431 46
names(data.esteem)## [1] "Subject" "Gender" "Education05" "Income87"
## [5] "Job05" "Income05" "Weight05" "HeightFeet05"
## [9] "HeightInch05" "Imagazine" "Inewspaper" "Ilibrary"
## [13] "MotherEd" "FatherEd" "FamilyIncome78" "Science"
## [17] "Arith" "Word" "Parag" "Number"
## [21] "Coding" "Auto" "Math" "Mechanic"
## [25] "Elec" "AFQT" "Esteem81_1" "Esteem81_2"
## [29] "Esteem81_3" "Esteem81_4" "Esteem81_5" "Esteem81_6"
## [33] "Esteem81_7" "Esteem81_8" "Esteem81_9" "Esteem81_10"
## [37] "Esteem87_1" "Esteem87_2" "Esteem87_3" "Esteem87_4"
## [41] "Esteem87_5" "Esteem87_6" "Esteem87_7" "Esteem87_8"
## [45] "Esteem87_9" "Esteem87_10"
summary(data.esteem)## Subject Gender Education05 Income87
## Min. : 2 female:1199 Min. : 6.0 Min. : -2
## 1st Qu.: 1592 male :1232 1st Qu.:12.0 1st Qu.: 4500
## Median : 3137 Median :13.0 Median :12000
## Mean : 3504 Mean :13.9 Mean :13399
## 3rd Qu.: 4668 3rd Qu.:16.0 3rd Qu.:19000
## Max. :12140 Max. :20.0 Max. :59387
##
## Job05
## 10 TO 430: Executive, Administrative and Managerial Occupations: 377
## 5000 TO 5930: Office and Administrative Support Workers : 360
## 4700 TO 4960: Sales and Related Workers : 205
## 6200 TO 6940: Construction Trade and Extraction Workers : 135
## 2200 TO 2340: Teachers : 120
## 9000 TO 9750: Transportation and Material Moving Workers : 117
## (Other) :1117
## Income05 Weight05 HeightFeet05 HeightInch05 Imagazine
## Min. : 63 Min. : 81 Min. :-4.00 Min. : 0.00 Min. :0.000
## 1st Qu.: 22650 1st Qu.:150 1st Qu.: 5.00 1st Qu.: 2.00 1st Qu.:0.000
## Median : 38500 Median :180 Median : 5.00 Median : 5.00 Median :1.000
## Mean : 49415 Mean :183 Mean : 5.18 Mean : 5.32 Mean :0.718
## 3rd Qu.: 61350 3rd Qu.:209 3rd Qu.: 5.00 3rd Qu.: 8.00 3rd Qu.:1.000
## Max. :703637 Max. :380 Max. : 8.00 Max. :11.00 Max. :1.000
##
## Inewspaper Ilibrary MotherEd FatherEd FamilyIncome78
## Min. :0.000 Min. :0.00 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:11.0 1st Qu.:10.0 1st Qu.:11167
## Median :1.000 Median :1.00 Median :12.0 Median :12.0 Median :20000
## Mean :0.861 Mean :0.77 Mean :11.7 Mean :11.8 Mean :21252
## 3rd Qu.:1.000 3rd Qu.:1.00 3rd Qu.:12.0 3rd Qu.:14.0 3rd Qu.:27500
## Max. :1.000 Max. :1.00 Max. :20.0 Max. :20.0 Max. :75001
##
## Science Arith Word Parag Number
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:13.0 1st Qu.:13.0 1st Qu.:23.0 1st Qu.:10.0 1st Qu.:29.0
## Median :17.0 Median :19.0 Median :28.0 Median :12.0 Median :36.0
## Mean :16.3 Mean :18.6 Mean :26.6 Mean :11.2 Mean :35.5
## 3rd Qu.:20.0 3rd Qu.:25.0 3rd Qu.:32.0 3rd Qu.:14.0 3rd Qu.:44.0
## Max. :25.0 Max. :30.0 Max. :35.0 Max. :15.0 Max. :50.0
##
## Coding Auto Math Mechanic Elec
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:38.0 1st Qu.:10.0 1st Qu.: 9.0 1st Qu.:11.0 1st Qu.: 9.0
## Median :48.0 Median :14.0 Median :14.0 Median :14.0 Median :12.0
## Mean :47.1 Mean :14.3 Mean :14.3 Mean :14.4 Mean :11.6
## 3rd Qu.:57.0 3rd Qu.:18.0 3rd Qu.:20.0 3rd Qu.:18.0 3rd Qu.:15.0
## Max. :84.0 Max. :25.0 Max. :25.0 Max. :25.0 Max. :20.0
##
## AFQT Esteem81_1 Esteem81_2 Esteem81_3 Esteem81_4
## Min. : 0.0 Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.: 31.9 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:3.00 1st Qu.:1.00
## Median : 57.0 Median :1.00 Median :1.00 Median :4.00 Median :2.00
## Mean : 54.7 Mean :1.42 Mean :1.42 Mean :3.51 Mean :1.57
## 3rd Qu.: 78.2 3rd Qu.:2.00 3rd Qu.:2.00 3rd Qu.:4.00 3rd Qu.:2.00
## Max. :100.0 Max. :4.00 Max. :4.00 Max. :4.00 Max. :4.00
##
## Esteem81_5 Esteem81_6 Esteem81_7 Esteem81_8 Esteem81_9
## Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:3.00 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:3.00 1st Qu.:3.00
## Median :4.00 Median :2.00 Median :2.00 Median :3.00 Median :3.00
## Mean :3.46 Mean :1.62 Mean :1.75 Mean :3.13 Mean :3.16
## 3rd Qu.:4.00 3rd Qu.:2.00 3rd Qu.:2.00 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :4.00 Max. :4.00 Max. :4.00 Max. :4.00 Max. :4.00
##
## Esteem81_10 Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4
## Min. :1.0 Min. :1.00 Min. :1.0 Min. :1.00 Min. :1.0
## 1st Qu.:3.0 1st Qu.:1.00 1st Qu.:1.0 1st Qu.:3.00 1st Qu.:1.0
## Median :3.0 Median :1.00 Median :1.0 Median :4.00 Median :1.0
## Mean :3.4 Mean :1.38 Mean :1.4 Mean :3.58 Mean :1.5
## 3rd Qu.:4.0 3rd Qu.:2.00 3rd Qu.:2.0 3rd Qu.:4.00 3rd Qu.:2.0
## Max. :4.0 Max. :4.00 Max. :4.0 Max. :4.00 Max. :4.0
##
## Esteem87_5 Esteem87_6 Esteem87_7 Esteem87_8 Esteem87_9
## Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.0 Min. :1.00
## 1st Qu.:3.00 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:3.0 1st Qu.:3.00
## Median :4.00 Median :2.00 Median :2.00 Median :3.0 Median :3.00
## Mean :3.53 Mean :1.59 Mean :1.72 Mean :3.1 Mean :3.06
## 3rd Qu.:4.00 3rd Qu.:2.00 3rd Qu.:2.00 3rd Qu.:4.0 3rd Qu.:4.00
## Max. :4.00 Max. :4.00 Max. :4.00 Max. :4.0 Max. :4.00
##
## Esteem87_10
## Min. :1.00
## 1st Qu.:3.00
## Median :3.00
## Mean :3.37
## 3rd Qu.:4.00
## Max. :4.00
##
Let concentrate on Esteem scores evaluated in 87.
data.esteem, then data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)] to reverse the score.)data.esteem[, c(37, 38, 40, 42, 43)] <- 5 - data.esteem[, c(37, 38, 40, 42, 43)]The average esteem scores seem to be roughly around 3.0 ~ 3.5.
barplot(colMeans(data.esteem[,c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)]), main="Esteem Measurements",xlab="Average Scores") The esteem scores are relatively consistent across the years, with only Esteem_4 changing from a majority 3 score to a majority 4 score.
library(tidyr)
library(ggplot2)
graphable_esteem <- data.esteem[,27:46]
ggplot(gather(graphable_esteem), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x') 3. Do esteem scores all positively correlated? Report the pairwise correlation table and write a brief summary.
The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables. There appears to be quite a strong positive correlation among all the esteem scores. For example, we see that question 1 (“I am a person of worth”) and question 2 (“I have a number of good qualities”) are highly correlated with a correlation coefficient of 0.7.
res2 <- cor(data.esteem[,c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)])
res2## Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1 1.000 0.704 0.448 0.528 0.399 0.464
## Esteem87_2 0.704 1.000 0.443 0.551 0.402 0.481
## Esteem87_3 0.448 0.443 1.000 0.408 0.549 0.410
## Esteem87_4 0.528 0.551 0.408 1.000 0.381 0.509
## Esteem87_5 0.399 0.402 0.549 0.381 1.000 0.405
## Esteem87_6 0.464 0.481 0.410 0.509 0.405 1.000
## Esteem87_7 0.379 0.410 0.343 0.422 0.370 0.600
## Esteem87_8 0.273 0.283 0.351 0.295 0.381 0.409
## Esteem87_9 0.236 0.259 0.349 0.287 0.354 0.364
## Esteem87_10 0.312 0.330 0.460 0.366 0.436 0.442
## Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1 0.379 0.273 0.236 0.312
## Esteem87_2 0.410 0.283 0.259 0.330
## Esteem87_3 0.343 0.351 0.349 0.460
## Esteem87_4 0.422 0.295 0.287 0.366
## Esteem87_5 0.370 0.381 0.354 0.436
## Esteem87_6 0.600 0.409 0.364 0.442
## Esteem87_7 1.000 0.389 0.352 0.390
## Esteem87_8 0.389 1.000 0.430 0.438
## Esteem87_9 0.352 0.430 1.000 0.579
## Esteem87_10 0.390 0.438 0.579 1.000
PCA on 10 esteem measurements. (centered but no scaling)
As shown in the output below, they are indeed orthogonal unit vectors. The PCs are all independent from one another and the squared sum of the PC1 Coefficient equals 1.
#Centering
library(dplyr)
center_scale <- function(x) {
scale(x, center= TRUE, scale = TRUE)
}
data.esteem.centered <- center_scale(data.esteem[,c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)])
data.esteem.centered <- as.data.frame(data.esteem.centered)
pc.2 <- prcomp(data.esteem.centered)
names(pc.2)## [1] "sdev" "rotation" "center" "scale" "x"
pc.2$rotation[, 1:2]## PC1 PC2
## Esteem87_1 0.324 -0.4452
## Esteem87_2 0.333 -0.4283
## Esteem87_3 0.322 0.0115
## Esteem87_4 0.324 -0.2877
## Esteem87_5 0.315 0.0793
## Esteem87_6 0.347 -0.0492
## Esteem87_7 0.315 0.0196
## Esteem87_8 0.280 0.3619
## Esteem87_9 0.277 0.4917
## Esteem87_10 0.318 0.3918
sum=0
for (val in pc.2$rotation[, "PC1"]){
sum = sum + val^2
}
sum## [1] 1
PC1 is all positive. It makes sense since they are the questions that we flippe dTheir values flipped for questions 1, 2, 4, 6, b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)
PC1s are entirely positive, so it represents the total esteem of someone surveyed, or their general view towards themselves. PC2 has negative values for question 1, 2, 4, and 6, which are the ones that we flipped earlier.
c) How is the PC1 score obtained for each subject? Write down the formula.
PC1 = 0.324Esteem87_1 + 0.333Esteem87_2 + 0.322Esteem87_3 + 0.324Esteem87_4 + 0.315Esteem87_5 + 0.347Esteem87_6 + 0.315Esteem87_7 + 0.280Esteem87_8 + 0.277Esteem87_9 + 0.318Esteem87_10
d) Are PC1 scores and PC2 scores in the data uncorrelated?
cor(pc.2$rotation[,1], scale(pc.2$rotation, scale = TRUE))## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## [1,] 1 -0.707 -0.0456 0.158 -0.445 0.0163 -0.25 -0.413 -0.152 -0.121
cor(pc.2$rotation[,2], scale(pc.2$rotation, scale = TRUE))## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## [1,] -0.707 1 -0.000137 0.000474 -0.00133 4.9e-05 -0.00075 -0.00124 -0.000456
## PC10
## [1,] -0.000364
Yes, all PC scores are uncorrelated and independent.
e) Plot PVE (Proportion of Variance Explained) and summarize the plot.
summary(pc.2)$importance## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.166 1.102 0.911 0.8090 0.7699 0.7037 0.6714 0.6346
## Proportion of Variance 0.469 0.121 0.083 0.0654 0.0593 0.0495 0.0451 0.0403
## Cumulative Proportion 0.469 0.590 0.673 0.7388 0.7981 0.8477 0.8927 0.9330
## PC9 PC10
## Standard deviation 0.6133 0.5421
## Proportion of Variance 0.0376 0.0294
## Cumulative Proportion 0.9706 1.0000
#Scree plot of variances
plot(pc.2)#Scree plot of PVE's
plot(summary(pc.2)$importance[2,],
ylab="PVE",
xlab="Number of PC's",
pch=16,
main="Scree Plot of PVE for esteem") The plot shows that PC1 holds almost the majority of variance, whereas PC3 and after account for under 10% of the variability.
f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?
#Scree plot of PVE's
plot(summary(pc.2)$importance[3,],
ylab="Cumulative PVE",
xlab="Number of PC's",
pch=16,
main="Scree Plot of Cumulative PVE for esteem") Around 60% of the variance in data is explained by the first two principal components.
g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data. Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)
#Scree plot of PVE's
lim <- c(-.1, 0.1)
biplot(pc.2, xlim=lim, ylim=lim, main = "Biplot of the PC's")
abline(v=0, h=0) PC1 is the sum of all esteem while PC2 is the difference between all esteem questions 8, 9, 10 (positive) and questions 1, 2, 4, and 6 (negative).
Apply k-means to cluster subjects on the original esteem scores
According to elbow rule, 3 clusters may be the optimal number of clusters.
# library("car")
# library("factoextra")
# fviz_nbclust(data.esteem[, c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)],kmeans, method="wss")b) Can you summarize common features within each cluster?
esteem.kmeans <- kmeans(data.esteem[, c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)],centers = 3)
str(esteem.kmeans)## List of 9
## $ cluster : int [1:2431] 2 1 2 3 1 1 1 2 1 1 ...
## $ centers : num [1:3, 1:10] 3.91 3.29 3.51 3.9 3.25 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:10] "Esteem87_1" "Esteem87_2" "Esteem87_3" "Esteem87_4" ...
## $ totss : num 8775
## $ withinss : num [1:3] 2241 1523 1597
## $ tot.withinss: num 5361
## $ betweenss : num 3414
## $ size : int [1:3] 1125 800 506
## $ iter : int 3
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
esteem.kmeans$cluster # one label for each team, k=3 many centers## [1] 2 1 2 3 1 1 1 2 1 1 1 2 1 1 1 3 1 3 2 2 2 2 3 1 1 1 1 3 1 3 1 3 1 3 1 1 2
## [38] 2 1 1 1 1 1 2 1 1 3 3 1 1 1 1 1 3 1 2 2 2 1 1 1 1 2 1 1 2 2 2 1 2 3 1 1 2
## [75] 3 1 1 2 2 1 3 1 1 3 1 2 1 1 1 3 3 3 2 3 2 1 1 1 2 1 1 1 1 1 2 1 1 3 1 1 2
## [112] 1 1 1 1 2 1 1 2 2 1 1 1 1 3 1 1 1 1 2 1 3 1 1 2 2 2 1 1 1 2 2 2 1 1 1 1 2
## [149] 2 3 1 1 3 1 2 1 1 3 2 2 2 1 2 1 1 1 1 1 1 1 3 1 2 2 2 1 2 3 2 1 1 2 1 1 1
## [186] 3 1 1 2 1 1 1 2 1 2 2 1 2 1 2 1 1 1 3 3 1 1 1 3 1 1 3 2 1 3 1 2 2 2 1 1 3
## [223] 2 1 3 1 2 3 1 1 2 1 1 1 1 1 1 1 1 2 1 1 2 2 1 2 3 2 1 1 3 2 2 1 1 1 2 2 2
## [260] 2 2 2 2 3 2 3 3 3 2 2 2 3 2 2 2 1 1 3 1 1 1 2 1 1 3 2 1 2 1 1 1 1 1 3 2 3
## [297] 3 1 2 2 1 3 2 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 3 2 2 2 1 1 3 2 3 1 2 2 1 1 2
## [334] 1 2 2 1 1 1 3 3 1 3 1 2 3 3 3 1 1 1 2 2 1 3 2 1 2 2 1 2 3 2 1 1 2 1 1 1 3
## [371] 2 2 2 2 1 3 3 3 1 2 2 1 1 3 1 2 3 1 1 1 3 1 2 2 2 1 1 1 3 2 2 1 3 2 2 2 1
## [408] 2 2 2 3 2 2 2 1 1 1 2 2 3 2 2 1 1 1 2 1 1 3 3 2 1 1 1 3 2 3 1 3 2 2 1 3 2
## [445] 1 2 1 1 1 3 1 1 1 1 3 2 3 3 1 1 1 1 1 3 3 1 3 2 1 3 1 1 1 3 3 2 2 2 3 1 3
## [482] 2 3 3 1 2 1 2 1 1 2 1 2 2 2 2 2 1 1 2 2 1 1 3 3 1 1 1 3 1 3 3 2 2 1 3 3 1
## [519] 1 1 2 3 2 2 3 1 1 3 2 1 1 2 1 1 2 2 3 2 2 3 1 1 2 1 1 2 2 1 1 3 3 2 3 2 1
## [556] 1 3 1 2 1 1 2 2 2 1 1 2 3 3 1 1 2 1 2 1 3 3 1 2 2 2 3 3 1 2 2 1 3 1 1 2 2
## [593] 2 2 3 1 3 1 1 1 2 2 1 1 1 2 2 1 2 3 3 1 1 2 1 2 1 1 2 1 1 1 1 2 3 2 1 2 2
## [630] 2 1 2 1 3 3 1 2 1 1 2 2 2 1 1 1 1 2 1 1 1 3 3 1 1 2 2 3 3 2 1 2 3 3 1 1 1
## [667] 1 1 2 3 1 3 1 1 1 1 3 1 2 1 1 2 3 2 1 1 3 1 1 1 1 2 2 2 2 3 2 3 2 1 2 1 1
## [704] 2 3 1 1 2 1 1 2 2 3 2 1 2 3 1 1 3 1 3 2 1 2 1 1 1 2 3 1 1 1 2 1 1 2 1 1 2
## [741] 1 2 3 3 1 2 1 1 2 1 1 3 1 3 1 2 1 2 2 3 1 3 3 3 1 2 3 2 1 1 2 2 3 1 2 1 1
## [778] 3 1 1 2 3 2 3 2 1 3 1 2 2 2 1 2 1 3 3 1 2 2 1 2 2 3 1 1 1 2 2 3 3 1 3 1 1
## [815] 1 1 3 2 2 2 3 3 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 3 2 3 3 3 3 1 1 1 3 3 3 1 1
## [852] 3 1 3 1 3 1 1 3 3 1 1 1 1 2 1 2 2 1 1 1 2 2 2 2 2 2 1 2 1 1 3 2 1 2 3 1 1
## [889] 3 2 1 2 3 1 2 1 1 3 1 3 3 2 1 1 1 1 2 1 2 3 2 1 1 1 1 2 3 1 1 3 3 1 3 2 1
## [926] 3 1 3 2 2 1 1 1 2 2 3 1 1 3 3 1 1 3 2 1 1 1 1 3 3 1 2 1 2 3 3 1 2 3 2 2 1
## [963] 2 1 3 2 3 2 2 1 1 1 3 3 3 3 3 3 2 1 2 3 1 1 2 2 1 3 1 3 1 2 1 1 2 1 2 1 1
## [1000] 1 3 2 2 2 3 1 1 2 1 1 1 3 2 1 3 1 2 1 1 3 1 1 1 2 2 3 1 1 2 1 3 3 3 3 3 3
## [1037] 3 3 1 1 3 1 3 3 2 1 3 2 3 3 1 3 2 2 1 1 1 3 3 1 1 3 2 3 2 2 2 1 1 1 3 1 1
## [1074] 1 1 3 3 2 2 3 1 1 1 3 2 2 1 1 2 1 2 2 2 1 1 1 3 1 1 1 3 2 2 2 1 3 2 1 1 1
## [1111] 1 3 2 1 1 1 1 3 1 2 1 1 1 1 2 1 3 1 1 2 1 2 1 2 2 2 2 1 1 1 1 2 3 1 3 3 1
## [1148] 3 1 1 1 1 3 3 2 2 3 2 2 3 1 2 3 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 1 1 3 1 1
## [1185] 2 3 1 2 2 1 1 3 3 2 1 2 2 2 2 1 1 2 3 1 2 1 1 2 1 1 1 1 2 2 3 1 2 2 3 1 1
## [1222] 1 1 3 2 1 1 2 2 1 1 2 1 1 2 3 3 2 3 1 2 3 3 1 1 1 1 3 1 2 2 3 2 1 1 1 2 3
## [1259] 3 1 1 1 3 1 1 1 1 3 2 3 1 2 1 1 2 2 1 1 2 3 1 2 1 3 1 2 2 2 1 2 2 2 2 1 1
## [1296] 2 1 2 2 3 2 3 1 1 3 3 2 1 1 3 1 1 1 1 3 2 1 2 1 1 2 1 1 1 2 1 1 2 2 1 2 2
## [1333] 1 2 2 1 1 3 2 3 1 2 1 2 3 2 3 1 1 1 1 1 3 1 1 2 2 1 1 1 2 2 3 1 1 1 2 2 2
## [1370] 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 2 3 3 1 1 1 1 1 3 1 1 2 1 1 1 3 1 2 3 2 3 2
## [1407] 2 3 2 2 2 2 1 1 2 1 1 1 2 2 2 2 2 2 2 1 2 1 1 3 1 2 3 1 3 1 1 1 1 1 1 2 1
## [1444] 2 3 2 2 3 2 1 2 1 3 3 2 1 1 2 2 2 3 1 1 1 3 3 1 3 3 2 3 1 1 2 1 2 2 2 1 3
## [1481] 2 1 2 1 1 2 3 1 2 3 1 3 3 1 3 1 1 2 2 3 1 1 1 2 1 1 1 1 3 1 1 1 2 2 3 2 1
## [1518] 2 2 1 3 3 1 1 1 2 2 1 1 1 2 1 1 3 2 2 1 1 3 1 2 2 1 3 1 1 3 2 3 1 3 1 2 1
## [1555] 1 3 1 2 1 3 2 2 1 1 1 2 1 1 1 1 2 1 2 1 1 3 1 1 3 1 1 2 3 1 1 3 2 2 2 3 2
## [1592] 2 1 1 3 3 1 1 2 2 1 2 2 2 3 1 1 1 1 1 3 1 1 3 2 2 2 2 2 1 2 3 1 1 3 1 1 1
## [1629] 1 3 1 1 2 1 1 2 2 3 1 2 2 2 1 1 2 1 1 1 1 3 2 1 2 3 2 3 2 1 2 3 2 1 3 1 1
## [1666] 2 1 3 3 1 1 2 2 1 2 1 1 3 2 1 2 1 2 1 1 3 3 3 3 1 2 1 3 1 1 2 1 3 3 3 1 2
## [1703] 1 2 1 3 2 2 2 3 1 1 1 2 2 2 3 2 1 2 2 2 2 1 1 3 1 2 1 3 2 1 2 1 2 2 1 1 2
## [1740] 1 3 1 1 1 1 1 1 1 2 2 1 1 1 2 3 1 3 2 3 1 2 1 1 2 2 2 1 2 1 1 2 3 3 1 3 2
## [1777] 2 2 1 2 3 3 1 2 1 1 3 1 2 2 2 3 3 1 2 1 1 1 1 1 1 3 2 3 1 1 1 1 1 1 1 1 1
## [1814] 1 3 3 2 2 1 2 2 2 1 3 2 2 3 2 3 1 1 3 1 1 1 1 1 1 2 2 2 1 3 1 3 2 2 3 2 1
## [1851] 2 3 1 2 3 3 2 2 2 2 3 2 1 2 1 1 1 3 2 1 2 2 3 2 2 3 2 2 1 1 1 1 2 2 2 2 3
## [1888] 1 1 1 3 3 1 2 1 1 1 3 1 1 3 3 2 2 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 1 3 2 1 1
## [1925] 2 1 1 1 1 2 1 3 2 1 3 1 2 1 3 2 1 2 1 3 2 1 1 3 1 3 1 2 3 2 2 1 1 1 2 2 2
## [1962] 2 3 2 2 1 1 1 1 2 2 3 1 3 1 2 1 3 1 2 1 3 3 1 3 1 1 2 3 2 2 2 2 1 3 2 2 2
## [1999] 2 3 2 1 1 1 1 1 1 2 1 2 3 1 3 3 1 2 1 3 1 1 1 2 1 2 1 2 2 3 2 2 1 2 2 3 3
## [2036] 1 1 3 2 3 1 2 3 2 1 1 3 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 3 2 2 1 2 2 2 3 1
## [2073] 2 2 2 2 2 3 2 2 2 2 2 3 3 1 1 1 1 3 3 1 2 2 1 1 3 1 1 3 2 1 1 1 3 1 2 1 2
## [2110] 3 1 1 1 3 2 3 2 1 1 3 1 2 1 2 1 2 1 3 1 1 2 1 2 2 1 3 1 1 2 2 1 2 2 2 2 1
## [2147] 2 2 1 3 1 2 2 1 1 3 1 1 2 3 2 1 1 1 3 3 3 3 1 3 2 1 1 1 1 2 2 3 2 2 1 1 3
## [2184] 3 3 2 1 1 1 3 2 2 1 2 2 2 3 2 1 3 1 1 1 3 1 2 2 3 3 1 3 2 2 2 3 1 1 2 2 1
## [2221] 2 2 1 1 3 3 2 2 3 3 1 2 2 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 3 2 2 3 1 1 1 1 2
## [2258] 1 2 3 3 1 1 2 2 3 1 1 1 3 3 3 2 2 1 1 1 1 3 1 1 2 1 2 1 1 1 3 1 1 1 2 1 2
## [2295] 1 2 3 2 1 2 3 3 2 1 1 1 1 1 1 1 2 2 1 3 3 1 1 3 2 1 1 1 3 1 1 1 1 1 1 1 1
## [2332] 2 1 1 3 3 1 2 2 3 1 2 2 2 2 1 3 1 1 2 1 1 1 1 1 3 2 1 1 1 1 2 2 2 2 2 3 1
## [2369] 3 1 1 2 1 1 3 3 1 2 3 2 1 2 2 3 1 1 1 1 2 1 1 1 3 1 1 1 2 2 3 3 1 3 2 2 3
## [2406] 3 1 3 1 1 3 2 2 1 1 2 2 2 1 1 1 2 2 1 1 1 1 3 1 1 3
esteem.kmeans$size # size of each cluster## [1] 1125 800 506
Income.clustered <- data.esteem %>% select (Income05) %>%
mutate(group = esteem.kmeans$cluster) %>%
arrange(desc(group))
ggplot(Income.clustered, aes(x = Income05, fill = as.factor(group)))+
geom_histogram(binwidth = 50000, position = "dodge") +
theme_classic()AFQT.clustered <- data.esteem %>% select (AFQT) %>%
mutate(group = esteem.kmeans$cluster) %>%
arrange(desc(group))
ggplot(AFQT.clustered, aes(x = AFQT, fill = as.factor(group)))+
geom_histogram(binwidth = 50000, position = "dodge") +
theme_classic()c1 <- data.esteem %>%
mutate(group = esteem.kmeans$cluster)%>%
filter(group == 1)
ggplot(c1, aes(x = Income05)) +
geom_histogram(binwidth = 50000)c2 <- data.esteem %>%
mutate(group = esteem.kmeans$cluster)%>%
filter(group == 2)
ggplot(c2, aes(x = Income05)) +
geom_histogram(binwidth = 50000)c3 <- data.esteem %>%
mutate(group = esteem.kmeans$cluster)%>%
filter(group == 3)
ggplot(c3, aes(x = Income05)) +
geom_histogram(binwidth = 50000)The size of the three clusters turn out to be 826, 818, and 787, respectively. The three clusters have similar spreads in terms of income, with cluster 3 being slightly more left skewed, followed by cluster 2 and cluster 1. As for AFQT, group 1 is very left skewed with most individuals having higher scores, group 3 is comparatively left-skewed, while group 2 is right-skewed and more uniform with more people scoring lower.
c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.
esteem.pca <- prcomp(data.esteem[,c(37, 38, 39, 40, 41, 42, 43, 44, 45, 46)],center= TRUE, scale = TRUE)
library("ggplot2")
esteem.final1 <- data.frame(pc1 = esteem.pca$x[,1], pc2 = esteem.pca$x[,2],
group = as.factor(esteem.kmeans$cluster))
ggplot(data = esteem.final1, aes(x = pc1, y = pc2, col=group)) + geom_point() + ggtitle("Clustering over PC1 and PC2")esteem.final2 <- data.frame(Income = data.esteem["FamilyIncome78"], AFQT = data.esteem["AFQT"], group = as.factor(esteem.kmeans$cluster))
ggplot(data = esteem.final2, aes(x = FamilyIncome78, y = AFQT, col=group)) + geom_point() + ggtitle("Clustering over Income and AFQT Score")esteem.final2 <- data.frame(Education = data.esteem["Education05"], AFQT = data.esteem["Weight05"], group = as.factor(esteem.kmeans$cluster))
ggplot(data = esteem.final2, aes(x = Education05, y = Weight05, col=group)) + geom_point() + ggtitle("Clustering over Income and AFQT Score")As shown in the plots above, total esteem score is very differentiated across the different clusters. Cluster 1 has the highest esteem, followed by cluster 3, and then cluster 2. However, there is no clear differentiation among the 3 clusters in terms of variables such as weight, education, and income.
We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.
Personal information: gender, education (05, problematic), log(income) in 87, job type in 87, Body mass index as a measure of health (The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m²). Since BMI is measured in 05, this will not be a good practice to be inclueded as possible variables.
Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd, FatherEd, FamilyIncome78. Do set indicators Imagazine, Inewspaper and Ilibrary as factors.
Use PC1 of SVABS as level of intelligence
data.esteem %>%
select(Income87)## Income87
## 1 16000
## 2 18000
## 3 0
## 4 9000
## 5 15000
## 6 2200
## 7 27000
## 8 20000
## 9 28000
## 10 27000
## 11 18500
## 12 9500
## 13 17000
## 14 30000
## 15 16300
## 16 13000
## 17 26500
## 18 20000
## 19 0
## 20 0
## 21 18000
## 22 31000
## 23 17000
## 24 17900
## 25 18500
## 26 19070
## 27 14000
## 28 19450
## 29 0
## 30 18000
## 31 22000
## 32 14000
## 33 12561
## 34 23000
## 35 12000
## 36 30000
## 37 22000
## 38 59387
## 39 7000
## 40 26392
## 41 31000
## 42 13000
## 43 32000
## 44 14000
## 45 2000
## 46 17700
## 47 2000
## 48 8000
## 49 16000
## 50 -1
## 51 0
## 52 18000
## 53 14000
## 54 18000
## 55 30000
## 56 24000
## 57 19500
## 58 14500
## 59 5000
## 60 14450
## 61 12000
## 62 9000
## 63 0
## 64 22500
## 65 17500
## 66 11000
## 67 37000
## 68 11000
## 69 18000
## 70 24000
## 71 0
## 72 29000
## 73 2100
## 74 14000
## 75 17000
## 76 59387
## 77 33000
## 78 0
## 79 0
## 80 3000
## 81 14500
## 82 13000
## 83 17500
## 84 9900
## 85 15800
## 86 0
## 87 4700
## 88 11200
## 89 16750
## 90 5723
## 91 13000
## 92 11000
## 93 5000
## 94 250
## 95 7000
## 96 21000
## 97 30000
## 98 59387
## 99 20000
## 100 8200
## 101 20203
## 102 26000
## 103 18000
## 104 12000
## 105 18000
## 106 14000
## 107 8500
## 108 0
## 109 10000
## 110 2000
## 111 20000
## 112 15000
## 113 17800
## 114 24600
## 115 59387
## 116 0
## 117 25000
## 118 14500
## 119 19000
## 120 26300
## 121 19000
## 122 59387
## 123 10000
## 124 27000
## 125 20000
## 126 19700
## 127 20000
## 128 25000
## 129 17000
## 130 14000
## 131 6800
## 132 50
## 133 14000
## 134 20000
## 135 12000
## 136 22000
## 137 16000
## 138 -2
## 139 59387
## 140 13500
## 141 19000
## 142 20900
## 143 0
## 144 39000
## 145 26000
## 146 15000
## 147 39500
## 148 1000
## 149 13000
## 150 35000
## 151 24000
## 152 22000
## 153 4000
## 154 2800
## 155 10000
## 156 3000
## 157 59387
## 158 10000
## 159 23000
## 160 7000
## 161 7000
## 162 22000
## 163 1500
## 164 11000
## 165 10000
## 166 3300
## 167 0
## 168 10686
## 169 25000
## 170 19500
## 171 17750
## 172 35000
## 173 15800
## 174 26000
## 175 24000
## 176 6300
## 177 0
## 178 0
## 179 22000
## 180 36000
## 181 6000
## 182 7000
## 183 25000
## 184 27000
## 185 8000
## 186 2000
## 187 0
## 188 8400
## 189 8300
## 190 12000
## 191 32000
## 192 0
## 193 10300
## 194 18000
## 195 28500
## 196 16500
## 197 15500
## 198 12000
## 199 16000
## 200 59387
## 201 11000
## 202 5000
## 203 0
## 204 17000
## 205 0
## 206 59387
## 207 4000
## 208 14000
## 209 20000
## 210 13650
## 211 8000
## 212 13300
## 213 15000
## 214 28450
## 215 800
## 216 25000
## 217 9900
## 218 4000
## 219 18000
## 220 0
## 221 3500
## 222 16000
## 223 16500
## 224 10000
## 225 0
## 226 0
## 227 15500
## 228 21000
## 229 11400
## 230 32000
## 231 2570
## 232 22000
## 233 30000
## 234 26000
## 235 27000
## 236 21000
## 237 12000
## 238 5000
## 239 25000
## 240 10000
## 241 16004
## 242 26000
## 243 12500
## 244 975
## 245 7000
## 246 1100
## 247 13000
## 248 32000
## 249 7700
## 250 0
## 251 30000
## 252 3000
## 253 -2
## 254 30000
## 255 26000
## 256 1496
## 257 18000
## 258 8000
## 259 1300
## 260 5000
## 261 10400
## 262 0
## 263 8000
## 264 5000
## 265 18500
## 266 7000
## 267 1592
## 268 12000
## 269 13500
## 270 59387
## 271 26000
## 272 1500
## 273 14500
## 274 15000
## 275 9000
## 276 7500
## 277 10350
## 278 35000
## 279 27000
## 280 10000
## 281 0
## 282 0
## 283 500
## 284 8000
## 285 1500
## 286 0
## 287 9600
## 288 21000
## 289 4000
## 290 31000
## 291 0
## 292 0
## 293 15000
## 294 12000
## 295 26000
## 296 700
## 297 20000
## 298 28000
## 299 59387
## 300 3500
## 301 19500
## 302 250
## 303 0
## 304 920
## 305 10000
## 306 20500
## 307 31000
## 308 10500
## 309 4000
## 310 8000
## 311 0
## 312 13000
## 313 10000
## 314 10000
## 315 9000
## 316 7000
## 317 36000
## 318 35000
## 319 21000
## 320 3000
## 321 4000
## 322 15000
## 323 3000
## 324 26000
## 325 6000
## 326 24800
## 327 0
## 328 3100
## 329 16000
## 330 0
## 331 9300
## 332 0
## 333 32000
## 334 0
## 335 7300
## 336 6000
## 337 18343
## 338 6400
## 339 30000
## 340 4600
## 341 7000
## 342 20000
## 343 0
## 344 36000
## 345 33000
## 346 16000
## 347 7000
## 348 18000
## 349 14000
## 350 3000
## 351 0
## 352 11000
## 353 7000
## 354 5000
## 355 24000
## 356 15000
## 357 25000
## 358 22000
## 359 18000
## 360 59387
## 361 14600
## 362 6000
## 363 0
## 364 15000
## 365 15000
## 366 21000
## 367 -2
## 368 21000
## 369 11000
## 370 21100
## 371 34000
## 372 13000
## 373 14000
## 374 59387
## 375 15000
## 376 28000
## 377 14000
## 378 30000
## 379 20000
## 380 59387
## 381 18000
## 382 20000
## 383 25000
## 384 12000
## 385 20000
## 386 30000
## 387 15000
## 388 22000
## 389 15400
## 390 7000
## 391 17000
## 392 16700
## 393 15000
## 394 17000
## 395 13000
## 396 9000
## 397 15000
## 398 3500
## 399 20000
## 400 59387
## 401 14600
## 402 0
## 403 2000
## 404 25000
## 405 3250
## 406 4992
## 407 13000
## 408 6000
## 409 0
## 410 1200
## 411 4000
## 412 1000
## 413 9000
## 414 18540
## 415 14000
## 416 9000
## 417 5000
## 418 10000
## 419 17000
## 420 10000
## 421 0
## 422 0
## 423 21000
## 424 12000
## 425 20800
## 426 14000
## 427 17000
## 428 59387
## 429 17000
## 430 10000
## 431 9000
## 432 1800
## 433 12500
## 434 6000
## 435 21000
## 436 15000
## 437 10000
## 438 32700
## 439 0
## 440 59387
## 441 27000
## 442 2114
## 443 18000
## 444 10000
## 445 13000
## 446 16000
## 447 10000
## 448 22000
## 449 16800
## 450 3000
## 451 0
## 452 21000
## 453 16000
## 454 35000
## 455 400
## 456 12000
## 457 4800
## 458 2600
## 459 24500
## 460 14000
## 461 17000
## 462 20000
## 463 15000
## 464 13000
## 465 2880
## 466 3000
## 467 20000
## 468 5700
## 469 11500
## 470 16000
## 471 38000
## 472 15000
## 473 17000
## 474 9000
## 475 27000
## 476 9000
## 477 13000
## 478 9500
## 479 2121
## 480 12300
## 481 0
## 482 6000
## 483 0
## 484 14000
## 485 3500
## 486 9100
## 487 36000
## 488 10000
## 489 28160
## 490 -1
## 491 1500
## 492 2000
## 493 25000
## 494 11000
## 495 0
## 496 18553
## 497 6000
## 498 10500
## 499 10500
## 500 7000
## 501 16000
## 502 12000
## 503 2200
## 504 15000
## 505 16900
## 506 19960
## 507 15000
## 508 59387
## 509 700
## 510 7000
## 511 12500
## 512 19000
## 513 21000
## 514 19000
## 515 18200
## 516 11800
## 517 0
## 518 27000
## 519 20000
## 520 10000
## 521 14000
## 522 21000
## 523 9000
## 524 25200
## 525 20000
## 526 20000
## 527 36000
## 528 15000
## 529 6000
## 530 13000
## 531 0
## 532 10000
## 533 6000
## 534 3000
## 535 19000
## 536 17000
## 537 0
## 538 7000
## 539 1300
## 540 23000
## 541 8000
## 542 20500
## 543 20000
## 544 19000
## 545 11500
## 546 21000
## 547 0
## 548 10515
## 549 0
## 550 11700
## 551 -1
## 552 30000
## 553 21000
## 554 13000
## 555 19000
## 556 29000
## 557 30000
## 558 13000
## 559 6000
## 560 8685
## 561 15000
## 562 -2
## 563 1000
## 564 7400
## 565 0
## 566 11000
## 567 3100
## 568 9000
## 569 8000
## 570 21000
## 571 12000
## 572 5000
## 573 19000
## 574 33000
## 575 19000
## 576 19526
## 577 0
## 578 14500
## 579 10000
## 580 2000
## 581 13000
## 582 10500
## 583 20000
## 584 11000
## 585 0
## 586 16000
## 587 14000
## 588 13000
## 589 11000
## 590 9000
## 591 21000
## 592 10046
## 593 16000
## 594 6500
## 595 7000
## 596 12000
## 597 360
## 598 14300
## 599 10000
## 600 10000
## 601 35000
## 602 25000
## 603 24000
## 604 18000
## 605 7000
## 606 8300
## 607 3500
## 608 10000
## 609 0
## 610 33600
## 611 11000
## 612 8040
## 613 22000
## 614 15500
## 615 20000
## 616 9000
## 617 9000
## 618 2600
## 619 6500
## 620 15000
## 621 32000
## 622 0
## 623 14000
## 624 3500
## 625 9000
## 626 19000
## 627 0
## 628 10000
## 629 2200
## 630 14600
## 631 15000
## 632 16300
## 633 16000
## 634 0
## 635 18000
## 636 35000
## 637 3300
## 638 17000
## 639 24000
## 640 20000
## 641 30000
## 642 14000
## 643 16472
## 644 34000
## 645 1800
## 646 1000
## 647 12000
## 648 23000
## 649 24000
## 650 7000
## 651 6000
## 652 1800
## 653 3471
## 654 6000
## 655 13000
## 656 200
## 657 0
## 658 2000
## 659 7600
## 660 8000
## 661 20000
## 662 1100
## 663 0
## 664 0
## 665 13000
## 666 10000
## 667 12000
## 668 11000
## 669 0
## 670 18000
## 671 10000
## 672 20000
## 673 11000
## 674 25000
## 675 14000
## 676 14800
## 677 2000
## 678 21829
## 679 5902
## 680 20000
## 681 8000
## 682 13000
## 683 7000
## 684 0
## 685 22000
## 686 0
## 687 31000
## 688 12000
## 689 15000
## 690 20000
## 691 5700
## 692 8400
## 693 11000
## 694 12000
## 695 5600
## 696 14900
## 697 0
## 698 3000
## 699 13000
## 700 1000
## 701 31000
## 702 0
## 703 18000
## 704 14000
## 705 3000
## 706 16900
## 707 8100
## 708 0
## 709 11000
## 710 26500
## 711 13500
## 712 14000
## 713 0
## 714 21000
## 715 38000
## 716 0
## 717 -2
## 718 4375
## 719 5000
## 720 15000
## 721 18800
## 722 332
## 723 31000
## 724 24500
## 725 0
## 726 0
## 727 15000
## 728 23000
## 729 0
## 730 21222
## 731 10000
## 732 1600
## 733 0
## 734 18000
## 735 25000
## 736 17000
## 737 10000
## 738 23000
## 739 25000
## 740 0
## 741 14000
## 742 6000
## 743 20000
## 744 12000
## 745 14076
## 746 0
## 747 0
## 748 15400
## 749 16000
## 750 11700
## 751 5200
## 752 0
## 753 18000
## 754 7800
## 755 -1
## 756 0
## 757 3500
## 758 0
## 759 3000
## 760 32000
## 761 32000
## 762 14000
## 763 5866
## 764 2500
## 765 14000
## 766 19000
## 767 8000
## 768 4500
## 769 0
## 770 9543
## 771 11559
## 772 9000
## 773 10000
## 774 7600
## 775 0
## 776 8000
## 777 17000
## 778 30149
## 779 11000
## 780 8000
## 781 8500
## 782 20000
## 783 10400
## 784 11000
## 785 16000
## 786 25000
## 787 200
## 788 10000
## 789 18000
## 790 4654
## 791 22000
## 792 17000
## 793 17300
## 794 10000
## 795 788
## 796 600
## 797 11000
## 798 33000
## 799 22000
## 800 4000
## 801 2400
## 802 24000
## 803 14000
## 804 15000
## 805 20101
## 806 20000
## 807 15000
## 808 11000
## 809 12000
## 810 6000
## 811 22500
## 812 24000
## 813 15000
## 814 13600
## 815 59387
## 816 7000
## 817 4900
## 818 19500
## 819 24500
## 820 900
## 821 6000
## 822 0
## 823 4000
## 824 14000
## 825 4000
## 826 19000
## 827 17000
## 828 14000
## 829 0
## 830 17000
## 831 12000
## 832 20000
## 833 900
## 834 0
## 835 15000
## 836 8000
## 837 5200
## 838 500
## 839 8500
## 840 0
## 841 8800
## 842 11000
## 843 5550
## 844 35000
## 845 35000
## 846 17000
## 847 400
## 848 5000
## 849 21500
## 850 0
## 851 7000
## 852 13500
## 853 0
## 854 20000
## 855 21000
## 856 25000
## 857 9500
## 858 -1
## 859 13000
## 860 5600
## 861 10600
## 862 17300
## 863 14000
## 864 3000
## 865 10000
## 866 4500
## 867 10000
## 868 200
## 869 28900
## 870 8000
## 871 59387
## 872 3700
## 873 22000
## 874 11000
## 875 12000
## 876 14000
## 877 1300
## 878 12500
## 879 20000
## 880 15000
## 881 20000
## 882 2000
## 883 11000
## 884 10000
## 885 10000
## 886 0
## 887 16200
## 888 0
## 889 5000
## 890 14000
## 891 13000
## 892 0
## 893 8000
## 894 20000
## 895 5000
## 896 6000
## 897 4000
## 898 1150
## 899 29900
## 900 15000
## 901 5000
## 902 25000
## 903 25000
## 904 25500
## 905 13000
## 906 0
## 907 17000
## 908 18000
## 909 10000
## 910 32000
## 911 14000
## 912 59387
## 913 26810
## 914 25500
## 915 18000
## 916 -2
## 917 0
## 918 59387
## 919 8000
## 920 15000
## 921 660
## 922 15000
## 923 1844
## 924 32000
## 925 13600
## 926 0
## 927 0
## 928 0
## 929 22000
## 930 1200
## 931 0
## 932 15000
## 933 6120
## 934 28000
## 935 38000
## 936 28700
## 937 13000
## 938 12000
## 939 0
## 940 100
## 941 0
## 942 18000
## 943 15000
## 944 26000
## 945 25000
## 946 15000
## 947 14000
## 948 0
## 949 12000
## 950 5450
## 951 4000
## 952 29000
## 953 11000
## 954 0
## 955 320
## 956 10000
## 957 0
## 958 13500
## 959 16000
## 960 25000
## 961 12000
## 962 13000
## 963 26000
## 964 14000
## 965 10000
## 966 1140
## 967 2000
## 968 5200
## 969 0
## 970 7000
## 971 23000
## 972 0
## 973 17000
## 974 12000
## 975 0
## 976 889
## 977 2800
## 978 5000
## 979 9000
## 980 11000
## 981 13000
## 982 13500
## 983 15000
## 984 25600
## 985 10000
## 986 15000
## 987 23660
## 988 14044
## 989 4526
## 990 0
## 991 5000
## 992 14000
## 993 0
## 994 0
## 995 7000
## 996 12317
## 997 20000
## 998 25000
## 999 7500
## 1000 10500
## 1001 4930
## 1002 22000
## 1003 11000
## 1004 0
## 1005 22000
## 1006 7600
## 1007 12000
## 1008 6800
## 1009 3000
## 1010 59387
## 1011 14500
## 1012 18000
## 1013 19000
## 1014 3500
## 1015 2000
## 1016 0
## 1017 0
## 1018 0
## 1019 25000
## 1020 0
## 1021 8000
## 1022 14698
## 1023 17000
## 1024 16000
## 1025 18000
## 1026 59387
## 1027 9000
## 1028 16000
## 1029 15000
## 1030 2000
## 1031 8000
## 1032 6800
## 1033 14000
## 1034 8000
## 1035 0
## 1036 0
## 1037 0
## 1038 6100
## 1039 9000
## 1040 8756
## 1041 1500
## 1042 5600
## 1043 9000
## 1044 3000
## 1045 28000
## 1046 0
## 1047 0
## 1048 11932
## 1049 1977
## 1050 14000
## 1051 36500
## 1052 11500
## 1053 13000
## 1054 14000
## 1055 15000
## 1056 12000
## 1057 12000
## 1058 0
## 1059 4000
## 1060 18500
## 1061 59387
## 1062 200
## 1063 600
## 1064 18000
## 1065 8300
## 1066 0
## 1067 9000
## 1068 15108
## 1069 13000
## 1070 9000
## 1071 11000
## 1072 23000
## 1073 13200
## 1074 8000
## 1075 15000
## 1076 6800
## 1077 15000
## 1078 16000
## 1079 9800
## 1080 0
## 1081 17000
## 1082 11000
## 1083 10255
## 1084 9000
## 1085 13000
## 1086 0
## 1087 10000
## 1088 24000
## 1089 10000
## 1090 24000
## 1091 12000
## 1092 10000
## 1093 160
## 1094 21000
## 1095 12000
## 1096 895
## 1097 10500
## 1098 11000
## 1099 21000
## 1100 18000
## 1101 24000
## 1102 11000
## 1103 9000
## 1104 9500
## 1105 4100
## 1106 11400
## 1107 22000
## 1108 17500
## 1109 17000
## 1110 59387
## 1111 18000
## 1112 2800
## 1113 21000
## 1114 23000
## 1115 26000
## 1116 32000
## 1117 25500
## 1118 28000
## 1119 15000
## 1120 7400
## 1121 35000
## 1122 13000
## 1123 32000
## 1124 21000
## 1125 14000
## 1126 59387
## 1127 30000
## 1128 3000
## 1129 59387
## 1130 0
## 1131 13000
## 1132 23000
## 1133 31000
## 1134 2000
## 1135 2000
## 1136 16000
## 1137 18500
## 1138 17000
## 1139 23000
## 1140 59387
## 1141 17000
## 1142 20600
## 1143 25500
## 1144 7000
## 1145 2880
## 1146 19000
## 1147 19000
## 1148 0
## 1149 11000
## 1150 15000
## 1151 28000
## 1152 7000
## 1153 1800
## 1154 25000
## 1155 33200
## 1156 0
## 1157 23000
## 1158 13000
## 1159 24000
## 1160 0
## 1161 0
## 1162 25500
## 1163 15100
## 1164 35000
## 1165 1600
## 1166 29000
## 1167 18000
## 1168 59387
## 1169 0
## 1170 19000
## 1171 22000
## 1172 17000
## 1173 25000
## 1174 25000
## 1175 26000
## 1176 15000
## 1177 25000
## 1178 1200
## 1179 11000
## 1180 39000
## 1181 20000
## 1182 22000
## 1183 33000
## 1184 18500
## 1185 0
## 1186 0
## 1187 5000
## 1188 22000
## 1189 13900
## 1190 8000
## 1191 11600
## 1192 0
## 1193 15000
## 1194 59387
## 1195 5200
## 1196 20000
## 1197 0
## 1198 11000
## 1199 11000
## 1200 -1
## 1201 18000
## 1202 11950
## 1203 13800
## 1204 2400
## 1205 200
## 1206 9600
## 1207 26000
## 1208 6876
## 1209 1100
## 1210 21000
## 1211 10500
## 1212 3500
## 1213 14000
## 1214 3528
## 1215 15000
## 1216 31000
## 1217 14000
## 1218 34500
## 1219 16000
## 1220 7500
## 1221 17000
## 1222 18500
## 1223 21000
## 1224 26400
## 1225 0
## 1226 35000
## 1227 25000
## 1228 13350
## 1229 17000
## 1230 12000
## 1231 0
## 1232 0
## 1233 18000
## 1234 500
## 1235 4500
## 1236 6000
## 1237 0
## 1238 18000
## 1239 21000
## 1240 32000
## 1241 0
## 1242 1000
## 1243 10000
## 1244 17000
## 1245 3600
## 1246 32000
## 1247 14200
## 1248 3000
## 1249 20000
## 1250 13500
## 1251 24900
## 1252 0
## 1253 59387
## 1254 10000
## 1255 18000
## 1256 7000
## 1257 13000
## 1258 5400
## 1259 7000
## 1260 11101
## 1261 8000
## 1262 16000
## 1263 15500
## 1264 25000
## 1265 19000
## 1266 21000
## 1267 10000
## 1268 2869
## 1269 6800
## 1270 12000
## 1271 13000
## 1272 1600
## 1273 27500
## 1274 14000
## 1275 14000
## 1276 12000
## 1277 0
## 1278 28600
## 1279 8000
## 1280 17000
## 1281 9000
## 1282 13000
## 1283 24700
## 1284 25000
## 1285 26000
## 1286 2000
## 1287 59387
## 1288 12000
## 1289 18000
## 1290 30000
## 1291 11000
## 1292 -2
## 1293 -1
## 1294 900
## 1295 23820
## 1296 0
## 1297 0
## 1298 21000
## 1299 3000
## 1300 23000
## 1301 16700
## 1302 16000
## 1303 12000
## 1304 23000
## 1305 11800
## 1306 16000
## 1307 14000
## 1308 7890
## 1309 16000
## 1310 3000
## 1311 25000
## 1312 16000
## 1313 8000
## 1314 200
## 1315 23000
## 1316 34000
## 1317 59387
## 1318 3200
## 1319 15008
## 1320 30000
## 1321 25500
## 1322 800
## 1323 59387
## 1324 17000
## 1325 8000
## 1326 300
## 1327 14500
## 1328 0
## 1329 21000
## 1330 0
## 1331 8000
## 1332 15000
## 1333 1500
## 1334 1200
## 1335 18000
## 1336 15000
## 1337 25000
## 1338 23000
## 1339 20000
## 1340 16000
## 1341 26000
## 1342 0
## 1343 32000
## 1344 12136
## 1345 -2
## 1346 -2
## 1347 -2
## 1348 15500
## 1349 6000
## 1350 12000
## 1351 31000
## 1352 15000
## 1353 9000
## 1354 5000
## 1355 5000
## 1356 28000
## 1357 22000
## 1358 7200
## 1359 16000
## 1360 16200
## 1361 27500
## 1362 12000
## 1363 4600
## 1364 26500
## 1365 32000
## 1366 8000
## 1367 14500
## 1368 22500
## 1369 59387
## 1370 15000
## 1371 16000
## 1372 0
## 1373 18780
## 1374 25000
## 1375 27000
## 1376 15000
## 1377 10400
## 1378 14000
## 1379 12000
## 1380 14000
## 1381 33500
## 1382 16000
## 1383 12500
## 1384 21500
## 1385 13500
## 1386 20000
## 1387 6000
## 1388 59387
## 1389 12000
## 1390 24600
## 1391 11000
## 1392 0
## 1393 0
## 1394 10600
## 1395 10940
## 1396 15900
## 1397 1400
## 1398 1500
## 1399 24090
## 1400 5283
## 1401 14000
## 1402 7500
## 1403 12000
## 1404 14500
## 1405 5000
## 1406 11000
## 1407 24600
## 1408 15400
## 1409 24000
## 1410 23000
## 1411 8000
## 1412 23000
## 1413 18000
## 1414 28000
## 1415 2000
## 1416 10000
## 1417 36000
## 1418 24000
## 1419 26000
## 1420 300
## 1421 31000
## 1422 22000
## 1423 1500
## 1424 26286
## 1425 29087
## 1426 9500
## 1427 2500
## 1428 35000
## 1429 0
## 1430 2000
## 1431 5200
## 1432 5000
## 1433 7220
## 1434 14000
## 1435 10100
## 1436 13000
## 1437 16000
## 1438 2200
## 1439 22000
## 1440 1200
## 1441 59387
## 1442 6000
## 1443 8000
## 1444 1000
## 1445 3000
## 1446 20000
## 1447 3200
## 1448 -2
## 1449 10000
## 1450 20000
## 1451 11000
## 1452 27000
## 1453 18500
## 1454 4424
## 1455 26000
## 1456 17400
## 1457 10000
## 1458 11000
## 1459 15600
## 1460 0
## 1461 0
## 1462 900
## 1463 18500
## 1464 24000
## 1465 15000
## 1466 6000
## 1467 8000
## 1468 11000
## 1469 5500
## 1470 3200
## 1471 30000
## 1472 10000
## 1473 22000
## 1474 17000
## 1475 20000
## 1476 30000
## 1477 25000
## 1478 2500
## 1479 16500
## 1480 1280
## 1481 6000
## 1482 20000
## 1483 12500
## 1484 4000
## 1485 11000
## 1486 11000
## 1487 9636
## 1488 10000
## 1489 11500
## 1490 1000
## 1491 14000
## 1492 14500
## 1493 15000
## 1494 9000
## 1495 8000
## 1496 4000
## 1497 12000
## 1498 10000
## 1499 14000
## 1500 13000
## 1501 19000
## 1502 10000
## 1503 10189
## 1504 15500
## 1505 15500
## 1506 16500
## 1507 7500
## 1508 10400
## 1509 24000
## 1510 59387
## 1511 4000
## 1512 4000
## 1513 3800
## 1514 0
## 1515 0
## 1516 1050
## 1517 15000
## 1518 8800
## 1519 17500
## 1520 12000
## 1521 25053
## 1522 1100
## 1523 16000
## 1524 36000
## 1525 14200
## 1526 7000
## 1527 30000
## 1528 8000
## 1529 2500
## 1530 14000
## 1531 0
## 1532 8000
## 1533 2360
## 1534 10500
## 1535 22000
## 1536 0
## 1537 24000
## 1538 23000
## 1539 835
## 1540 21000
## 1541 16000
## 1542 4000
## 1543 7000
## 1544 20500
## 1545 26000
## 1546 22000
## 1547 500
## 1548 59387
## 1549 15000
## 1550 2300
## 1551 14500
## 1552 12500
## 1553 20100
## 1554 7100
## 1555 17000
## 1556 16000
## 1557 0
## 1558 15300
## 1559 15000
## 1560 24000
## 1561 900
## 1562 0
## 1563 16000
## 1564 10500
## 1565 12000
## 1566 29010
## 1567 25000
## 1568 30000
## 1569 14000
## 1570 0
## 1571 35000
## 1572 0
## 1573 59387
## 1574 12000
## 1575 0
## 1576 5000
## 1577 20000
## 1578 15000
## 1579 13500
## 1580 8700
## 1581 10000
## 1582 325
## 1583 400
## 1584 23000
## 1585 22000
## 1586 6000
## 1587 0
## 1588 11000
## 1589 4000
## 1590 12000
## 1591 20000
## 1592 11000
## 1593 18000
## 1594 11000
## 1595 12000
## 1596 952
## 1597 9000
## 1598 25000
## 1599 12090
## 1600 17000
## 1601 12000
## 1602 19000
## 1603 1050
## 1604 21000
## 1605 6000
## 1606 0
## 1607 7000
## 1608 13000
## 1609 0
## 1610 24000
## 1611 300
## 1612 20000
## 1613 12480
## 1614 10000
## 1615 11000
## 1616 13000
## 1617 12500
## 1618 11000
## 1619 5000
## 1620 0
## 1621 59387
## 1622 600
## 1623 14700
## 1624 19000
## 1625 0
## 1626 13000
## 1627 16735
## 1628 14500
## 1629 12000
## 1630 0
## 1631 0
## 1632 8300
## 1633 10000
## 1634 20000
## 1635 18000
## 1636 20000
## 1637 10000
## 1638 12000
## 1639 500
## 1640 13000
## 1641 9360
## 1642 19200
## 1643 14000
## 1644 9900
## 1645 33000
## 1646 8000
## 1647 14000
## 1648 30000
## 1649 14000
## 1650 6000
## 1651 1000
## 1652 13704
## 1653 8280
## 1654 4500
## 1655 18700
## 1656 4500
## 1657 13000
## 1658 2000
## 1659 11000
## 1660 10000
## 1661 5000
## 1662 13312
## 1663 26400
## 1664 5000
## 1665 20000
## 1666 1400
## 1667 20000
## 1668 2000
## 1669 25000
## 1670 20000
## 1671 17000
## 1672 22000
## 1673 6200
## 1674 24750
## 1675 18000
## 1676 12000
## 1677 2500
## 1678 12000
## 1679 0
## 1680 22000
## 1681 -2
## 1682 14000
## 1683 0
## 1684 19000
## 1685 12000
## 1686 2000
## 1687 11300
## 1688 0
## 1689 4800
## 1690 1500
## 1691 27500
## 1692 15642
## 1693 4000
## 1694 5000
## 1695 17000
## 1696 21900
## 1697 10500
## 1698 10000
## 1699 6000
## 1700 0
## 1701 29000
## 1702 3000
## 1703 9000
## 1704 150
## 1705 0
## 1706 16000
## 1707 18000
## 1708 12800
## 1709 0
## 1710 17000
## 1711 9000
## 1712 19000
## 1713 59387
## 1714 30000
## 1715 0
## 1716 5000
## 1717 15500
## 1718 24000
## 1719 31000
## 1720 11000
## 1721 900
## 1722 0
## 1723 5780
## 1724 14000
## 1725 626
## 1726 33000
## 1727 24500
## 1728 27000
## 1729 3500
## 1730 2000
## 1731 0
## 1732 19584
## 1733 25000
## 1734 22000
## 1735 8000
## 1736 4600
## 1737 7363
## 1738 16000
## 1739 13000
## 1740 15000
## 1741 10500
## 1742 13000
## 1743 27000
## 1744 6470
## 1745 33000
## 1746 0
## 1747 6500
## 1748 2400
## 1749 6000
## 1750 7000
## 1751 35000
## 1752 500
## 1753 22500
## 1754 0
## 1755 14000
## 1756 13000
## 1757 15600
## 1758 0
## 1759 24000
## 1760 13600
## 1761 39000
## 1762 0
## 1763 35000
## 1764 15000
## 1765 32460
## 1766 18000
## 1767 20000
## 1768 3000
## 1769 32000
## 1770 31000
## 1771 10000
## 1772 0
## 1773 16000
## 1774 15000
## 1775 0
## 1776 3000
## 1777 0
## 1778 -2
## 1779 2500
## 1780 0
## 1781 3500
## 1782 2200
## 1783 17000
## 1784 11600
## 1785 26000
## 1786 24000
## 1787 2300
## 1788 5147
## 1789 6000
## 1790 20000
## 1791 0
## 1792 11200
## 1793 17000
## 1794 1900
## 1795 17000
## 1796 59387
## 1797 19000
## 1798 32000
## 1799 21000
## 1800 18000
## 1801 7000
## 1802 17000
## 1803 15000
## 1804 3500
## 1805 16000
## 1806 6000
## 1807 3000
## 1808 4000
## 1809 59387
## 1810 26000
## 1811 25600
## 1812 0
## 1813 25000
## 1814 0
## 1815 0
## 1816 0
## 1817 0
## 1818 0
## 1819 18000
## 1820 9000
## 1821 14000
## 1822 23000
## 1823 30000
## 1824 27000
## 1825 18600
## 1826 18000
## 1827 7800
## 1828 8000
## 1829 14000
## 1830 25000
## 1831 0
## 1832 2500
## 1833 59387
## 1834 35000
## 1835 3000
## 1836 32000
## 1837 38500
## 1838 32000
## 1839 10000
## 1840 25000
## 1841 7000
## 1842 15000
## 1843 0
## 1844 15500
## 1845 3500
## 1846 30000
## 1847 18000
## 1848 2500
## 1849 0
## 1850 6500
## 1851 0
## 1852 13000
## 1853 35000
## 1854 0
## 1855 120
## 1856 0
## 1857 11800
## 1858 9000
## 1859 23000
## 1860 9800
## 1861 20000
## 1862 21000
## 1863 5000
## 1864 6000
## 1865 24000
## 1866 0
## 1867 22000
## 1868 10000
## 1869 10000
## 1870 0
## 1871 12500
## 1872 5000
## 1873 1400
## 1874 10000
## 1875 1300
## 1876 11000
## 1877 20000
## 1878 7000
## 1879 22700
## 1880 2000
## 1881 6500
## 1882 1500
## 1883 39000
## 1884 10000
## 1885 8000
## 1886 0
## 1887 30000
## 1888 14000
## 1889 12000
## 1890 -1
## 1891 10000
## 1892 2000
## 1893 15000
## 1894 18000
## 1895 0
## 1896 15000
## 1897 0
## 1898 0
## 1899 0
## 1900 12200
## 1901 0
## 1902 0
## 1903 2500
## 1904 0
## 1905 0
## 1906 8000
## 1907 4200
## 1908 15200
## 1909 59387
## 1910 12000
## 1911 0
## 1912 2000
## 1913 9000
## 1914 6000
## 1915 18000
## 1916 600
## 1917 23000
## 1918 1169
## 1919 27000
## 1920 13600
## 1921 21000
## 1922 7500
## 1923 18000
## 1924 22000
## 1925 15000
## 1926 11000
## 1927 59387
## 1928 17000
## 1929 29700
## 1930 3500
## 1931 13000
## 1932 19000
## 1933 0
## 1934 0
## 1935 8000
## 1936 3000
## 1937 13000
## 1938 30000
## 1939 1300
## 1940 1300
## 1941 15600
## 1942 15500
## 1943 10000
## 1944 8000
## 1945 9771
## 1946 8160
## 1947 0
## 1948 7000
## 1949 1500
## 1950 6000
## 1951 0
## 1952 11500
## 1953 27000
## 1954 10000
## 1955 5000
## 1956 13000
## 1957 50
## 1958 22500
## 1959 12000
## 1960 18000
## 1961 1700
## 1962 11200
## 1963 20000
## 1964 28000
## 1965 0
## 1966 24000
## 1967 8000
## 1968 10000
## 1969 13000
## 1970 13000
## 1971 59387
## 1972 20100
## 1973 20000
## 1974 88
## 1975 0
## 1976 16500
## 1977 20000
## 1978 26000
## 1979 0
## 1980 59387
## 1981 7200
## 1982 0
## 1983 59387
## 1984 16000
## 1985 19200
## 1986 12500
## 1987 31000
## 1988 35000
## 1989 11000
## 1990 5495
## 1991 11000
## 1992 6000
## 1993 8400
## 1994 15000
## 1995 14500
## 1996 10900
## 1997 15200
## 1998 -2
## 1999 4000
## 2000 3000
## 2001 0
## 2002 4000
## 2003 23000
## 2004 2400
## 2005 12000
## 2006 26000
## 2007 18000
## 2008 0
## 2009 13400
## 2010 3020
## 2011 0
## 2012 1000
## 2013 -2
## 2014 4000
## 2015 22000
## 2016 20000
## 2017 0
## 2018 30000
## 2019 7000
## 2020 0
## 2021 19000
## 2022 0
## 2023 23000
## 2024 36000
## 2025 7100
## 2026 7000
## 2027 3500
## 2028 15000
## 2029 9000
## 2030 9000
## 2031 1200
## 2032 0
## 2033 18000
## 2034 5300
## 2035 200
## 2036 7200
## 2037 19250
## 2038 0
## 2039 12370
## 2040 3500
## 2041 18000
## 2042 16500
## 2043 0
## 2044 0
## 2045 24000
## 2046 2000
## 2047 5900
## 2048 0
## 2049 19000
## 2050 26900
## 2051 27000
## 2052 26000
## 2053 15443
## 2054 10000
## 2055 11000
## 2056 4700
## 2057 3500
## 2058 13000
## 2059 12000
## 2060 2500
## 2061 11000
## 2062 20000
## 2063 4000
## 2064 1000
## 2065 10000
## 2066 18000
## 2067 26000
## 2068 10000
## 2069 10000
## 2070 9000
## 2071 10800
## 2072 11000
## 2073 20000
## 2074 30000
## 2075 20000
## 2076 17000
## 2077 19200
## 2078 18000
## 2079 0
## 2080 30000
## 2081 3000
## 2082 0
## 2083 13000
## 2084 11000
## 2085 17500
## 2086 17000
## 2087 19000
## 2088 2000
## 2089 13000
## 2090 16000
## 2091 400
## 2092 27600
## 2093 0
## 2094 0
## 2095 18500
## 2096 5000
## 2097 8000
## 2098 0
## 2099 25000
## 2100 500
## 2101 21000
## 2102 8000
## 2103 0
## 2104 24000
## 2105 4000
## 2106 9000
## 2107 22000
## 2108 24000
## 2109 0
## 2110 12000
## 2111 18000
## 2112 26000
## 2113 8000
## 2114 4800
## 2115 3500
## 2116 25000
## 2117 0
## 2118 3020
## 2119 20000
## 2120 14000
## 2121 13000
## 2122 0
## 2123 21000
## 2124 300
## 2125 0
## 2126 15000
## 2127 6000
## 2128 11000
## 2129 15700
## 2130 19000
## 2131 20000
## 2132 0
## 2133 12593
## 2134 6500
## 2135 5000
## 2136 3000
## 2137 21000
## 2138 15000
## 2139 59387
## 2140 20000
## 2141 19900
## 2142 12000
## 2143 0
## 2144 5500
## 2145 30000
## 2146 18000
## 2147 28000
## 2148 24500
## 2149 23000
## 2150 700
## 2151 16000
## 2152 8000
## 2153 14000
## 2154 9300
## 2155 20400
## 2156 1200
## 2157 0
## 2158 8500
## 2159 5000
## 2160 2000
## 2161 9000
## 2162 7919
## 2163 16000
## 2164 9000
## 2165 14000
## 2166 3041
## 2167 0
## 2168 10000
## 2169 24000
## 2170 900
## 2171 0
## 2172 6000
## 2173 0
## 2174 6667
## 2175 0
## 2176 20000
## 2177 4434
## 2178 9900
## 2179 520
## 2180 -2
## 2181 8000
## 2182 17000
## 2183 13000
## 2184 14000
## 2185 1200
## 2186 10500
## 2187 0
## 2188 25170
## 2189 0
## 2190 10000
## 2191 -2
## 2192 8000
## 2193 12000
## 2194 15000
## 2195 10000
## 2196 59387
## 2197 0
## 2198 8000
## 2199 0
## 2200 12000
## 2201 14200
## 2202 28000
## 2203 18000
## 2204 17000
## 2205 0
## 2206 8000
## 2207 14000
## 2208 11000
## 2209 8000
## 2210 10000
## 2211 10800
## 2212 17500
## 2213 21000
## 2214 8100
## 2215 17900
## 2216 14000
## 2217 11000
## 2218 7000
## 2219 16800
## 2220 14500
## 2221 1300
## 2222 20000
## 2223 18000
## 2224 14000
## 2225 8000
## 2226 12000
## 2227 9000
## 2228 7000
## 2229 7500
## 2230 7000
## 2231 2500
## 2232 10000
## 2233 20000
## 2234 15500
## 2235 2200
## 2236 38000
## 2237 12000
## 2238 16500
## 2239 10000
## 2240 30000
## 2241 0
## 2242 5300
## 2243 30000
## 2244 0
## 2245 19000
## 2246 13600
## 2247 24000
## 2248 0
## 2249 25000
## 2250 23000
## 2251 8000
## 2252 18000
## 2253 9600
## 2254 0
## 2255 19000
## 2256 25000
## 2257 59387
## 2258 18000
## 2259 9400
## 2260 19000
## 2261 13000
## 2262 17100
## 2263 6000
## 2264 13500
## 2265 8400
## 2266 0
## 2267 22000
## 2268 19000
## 2269 0
## 2270 7000
## 2271 5800
## 2272 18000
## 2273 0
## 2274 13000
## 2275 0
## 2276 7200
## 2277 38300
## 2278 15000
## 2279 37000
## 2280 18000
## 2281 22000
## 2282 0
## 2283 1340
## 2284 5400
## 2285 29000
## 2286 10000
## 2287 9000
## 2288 0
## 2289 0
## 2290 18000
## 2291 0
## 2292 10000
## 2293 17000
## 2294 18000
## 2295 9500
## 2296 0
## 2297 0
## 2298 6538
## 2299 14400
## 2300 0
## 2301 9200
## 2302 8000
## 2303 19000
## 2304 0
## 2305 5000
## 2306 15000
## 2307 11000
## 2308 2000
## 2309 22800
## 2310 4000
## 2311 11000
## 2312 0
## 2313 16000
## 2314 12000
## 2315 7000
## 2316 18000
## 2317 12000
## 2318 15200
## 2319 22500
## 2320 0
## 2321 12000
## 2322 7000
## 2323 -2
## 2324 8000
## 2325 27800
## 2326 10500
## 2327 0
## 2328 13872
## 2329 3500
## 2330 59387
## 2331 15000
## 2332 14000
## 2333 16000
## 2334 14000
## 2335 4000
## 2336 5700
## 2337 3000
## 2338 17000
## 2339 0
## 2340 4000
## 2341 4000
## 2342 31000
## 2343 12000
## 2344 0
## 2345 3000
## 2346 0
## 2347 0
## 2348 10000
## 2349 30000
## 2350 13000
## 2351 24000
## 2352 7000
## 2353 29565
## 2354 25800
## 2355 25000
## 2356 10000
## 2357 6000
## 2358 8000
## 2359 1000
## 2360 32000
## 2361 23000
## 2362 2000
## 2363 0
## 2364 4320
## 2365 7000
## 2366 13000
## 2367 0
## 2368 6000
## 2369 14000
## 2370 20000
## 2371 30000
## 2372 3000
## 2373 23000
## 2374 14000
## 2375 16400
## 2376 5000
## 2377 23000
## 2378 175
## 2379 37251
## 2380 7800
## 2381 36700
## 2382 19000
## 2383 31000
## 2384 21000
## 2385 17000
## 2386 14000
## 2387 0
## 2388 6000
## 2389 15000
## 2390 0
## 2391 15000
## 2392 15000
## 2393 0
## 2394 3500
## 2395 11000
## 2396 23000
## 2397 26959
## 2398 957
## 2399 9000
## 2400 0
## 2401 14871
## 2402 5000
## 2403 23000
## 2404 12000
## 2405 12500
## 2406 0
## 2407 13500
## 2408 14800
## 2409 21000
## 2410 6000
## 2411 14000
## 2412 14000
## 2413 16000
## 2414 2000
## 2415 24000
## 2416 20000
## 2417 6000
## 2418 5000
## 2419 12000
## 2420 20000
## 2421 21000
## 2422 9000
## 2423 10000
## 2424 3000
## 2425 10000
## 2426 17000
## 2427 8000
## 2428 1000
## 2429 9000
## 2430 14000
## 2431 5000
data.esteem.pca <- data.esteem %>%
select(Gender, Education05, Job05,MotherEd, FatherEd, FamilyIncome78) %>%
mutate(data.esteem, log_income = log(Income87+2)) %>%
mutate(data.esteem, bmi = Weight05/(HeightFeet05^2))
data.esteem.pca <- data.esteem.pca %>%
select(Gender, Education05, Job05,MotherEd, FatherEd, FamilyIncome78, bmi, log_income, Esteem87_1, Esteem87_2, Esteem87_3, Esteem87_4, Esteem87_5, Esteem87_6, Esteem87_7, Esteem87_8, Esteem87_9, Esteem87_10)
data.esteem.pca <- data.esteem.pca %>%
mutate(pc1_esteem = 0.324*Esteem87_1 + 0.333*Esteem87_2 + 0.322*Esteem87_3 + 0.324*Esteem87_4 + 0.315*Esteem87_5 + 0.347*Esteem87_6 + 0.315*Esteem87_7 + 0.280*Esteem87_8 + 0.277*Esteem87_9 + 0.318*Esteem87_10)
data.esteem.pca <- data.esteem.pca %>%
mutate(log_income = log_income + 0.00000001)b) Run a few regression models between PC1 of all the esteem scores and suitable variables listed in a). Find a final best model with your own criterion.
- How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met.
- Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem.
rgr.data <- na.omit(data.esteem.pca)
rgr.data <- rgr.data %>%
filter(!is.infinite(rgr.data$log_income))
model1 <- lm(pc1_esteem ~ Gender+ Education05+Job05+MotherEd+FatherEd+FamilyIncome78+bmi+log_income, rgr.data)
summary(model1)##
## Call:
## lm(formula = pc1_esteem ~ Gender + Education05 + Job05 + MotherEd +
## FatherEd + FamilyIncome78 + bmi + log_income, data = rgr.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.176 -0.944 -0.005 0.993 2.865
##
## Coefficients:
## Estimate
## (Intercept) 8.74e+00
## Gendermale 1.65e-01
## Education05 8.03e-02
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 4.53e-01
## Job051000 TO 1240: Mathematical and Computer Scientists 5.05e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 2.32e-02
## Job051600 TO 1760: Physical Scientists -7.90e-01
## Job051800 TO 1860: Social Scientists and Related Workers -3.07e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians 2.75e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 1.56e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 2.49e-01
## Job052200 TO 2340: Teachers 1.92e-01
## Job052400 TO 2550: Education, Training and Library Workers 1.96e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 7.62e-01
## Job052800 TO 2960: Media and Communications Workers 3.67e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 4.63e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations -1.13e-01
## Job053700 TO 3950: Protective Service Occupations 5.67e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations -1.43e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations -2.59e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers -6.97e-01
## Job054500 TO 4650: Personal Care and Service Workers 2.71e-01
## Job054700 TO 4960: Sales and Related Workers 2.84e-01
## Job05500 TO 950: Management Related Occupations 5.27e-01
## Job055000 TO 5930: Office and Administrative Support Workers 2.97e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations -1.75e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers 2.88e-02
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 1.42e-01
## Job057700 TO 7750: Production and Operating Workers 1.87e-01
## Job057800 TO 7850: Food Preparation Occupations 1.89e-01
## Job057900 TO 8960: Setters, Operators and Tenders 1.39e-01
## Job059000 TO 9750: Transportation and Material Moving Workers -1.13e-01
## Job059990: Uncodeable -5.12e-02
## MotherEd 2.81e-02
## FatherEd 1.03e-02
## FamilyIncome78 2.94e-06
## bmi -1.42e-02
## log_income 2.37e-02
## Std. Error
## (Intercept) 2.66e-01
## Gendermale 5.91e-02
## Education05 1.28e-02
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 1.73e-01
## Job051000 TO 1240: Mathematical and Computer Scientists 2.22e-01
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 2.33e-01
## Job051600 TO 1760: Physical Scientists 6.24e-01
## Job051800 TO 1860: Social Scientists and Related Workers 5.18e-01
## Job051900 TO 1960: Life, Physical and Social Science Technicians 4.82e-01
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 2.51e-01
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 3.52e-01
## Job052200 TO 2340: Teachers 2.00e-01
## Job052400 TO 2550: Education, Training and Library Workers 2.76e-01
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 2.95e-01
## Job052800 TO 2960: Media and Communications Workers 3.71e-01
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 2.16e-01
## Job053300 TO 3650: Health Care Technical and Support Occupations 2.02e-01
## Job053700 TO 3950: Protective Service Occupations 2.30e-01
## Job054000 TO 4160: Food Preparation and Serving Related Occupations 2.18e-01
## Job054200 TO 4250: Cleaning and Building Service Occupations 2.20e-01
## Job054300 TO 4430: Entertainment Attendants and Related Workers 4.14e-01
## Job054500 TO 4650: Personal Care and Service Workers 2.46e-01
## Job054700 TO 4960: Sales and Related Workers 1.82e-01
## Job05500 TO 950: Management Related Occupations 1.99e-01
## Job055000 TO 5930: Office and Administrative Support Workers 1.73e-01
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 4.34e-01
## Job056200 TO 6940: Construction Trade and Extraction Workers 1.95e-01
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 2.01e-01
## Job057700 TO 7750: Production and Operating Workers 2.38e-01
## Job057800 TO 7850: Food Preparation Occupations 6.23e-01
## Job057900 TO 8960: Setters, Operators and Tenders 1.99e-01
## Job059000 TO 9750: Transportation and Material Moving Workers 1.98e-01
## Job059990: Uncodeable 1.21e+00
## MotherEd 1.25e-02
## FatherEd 9.22e-03
## FamilyIncome78 1.94e-06
## bmi 1.40e-02
## log_income 8.59e-03
## t value
## (Intercept) 32.87
## Gendermale 2.79
## Education05 6.25
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 2.62
## Job051000 TO 1240: Mathematical and Computer Scientists 2.27
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 0.10
## Job051600 TO 1760: Physical Scientists -1.27
## Job051800 TO 1860: Social Scientists and Related Workers -0.59
## Job051900 TO 1960: Life, Physical and Social Science Technicians 0.57
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 0.62
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 0.71
## Job052200 TO 2340: Teachers 0.96
## Job052400 TO 2550: Education, Training and Library Workers 0.71
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 2.59
## Job052800 TO 2960: Media and Communications Workers 0.99
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 2.14
## Job053300 TO 3650: Health Care Technical and Support Occupations -0.56
## Job053700 TO 3950: Protective Service Occupations 2.46
## Job054000 TO 4160: Food Preparation and Serving Related Occupations -0.65
## Job054200 TO 4250: Cleaning and Building Service Occupations -1.18
## Job054300 TO 4430: Entertainment Attendants and Related Workers -1.68
## Job054500 TO 4650: Personal Care and Service Workers 1.10
## Job054700 TO 4960: Sales and Related Workers 1.56
## Job05500 TO 950: Management Related Occupations 2.64
## Job055000 TO 5930: Office and Administrative Support Workers 1.71
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations -0.40
## Job056200 TO 6940: Construction Trade and Extraction Workers 0.15
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 0.70
## Job057700 TO 7750: Production and Operating Workers 0.79
## Job057800 TO 7850: Food Preparation Occupations 0.30
## Job057900 TO 8960: Setters, Operators and Tenders 0.70
## Job059000 TO 9750: Transportation and Material Moving Workers -0.57
## Job059990: Uncodeable -0.04
## MotherEd 2.25
## FatherEd 1.11
## FamilyIncome78 1.52
## bmi -1.01
## log_income 2.76
## Pr(>|t|)
## (Intercept) < 2e-16
## Gendermale 0.0053
## Education05 4.9e-10
## Job0510 TO 430: Executive, Administrative and Managerial Occupations 0.0090
## Job051000 TO 1240: Mathematical and Computer Scientists 0.0232
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians 0.9209
## Job051600 TO 1760: Physical Scientists 0.2059
## Job051800 TO 1860: Social Scientists and Related Workers 0.5529
## Job051900 TO 1960: Life, Physical and Social Science Technicians 0.5691
## Job052000 TO 2060: Counselors, Sociala and Religious Workers 0.5338
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers 0.4792
## Job052200 TO 2340: Teachers 0.3392
## Job052400 TO 2550: Education, Training and Library Workers 0.4768
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers 0.0097
## Job052800 TO 2960: Media and Communications Workers 0.3228
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners 0.0323
## Job053300 TO 3650: Health Care Technical and Support Occupations 0.5759
## Job053700 TO 3950: Protective Service Occupations 0.0140
## Job054000 TO 4160: Food Preparation and Serving Related Occupations 0.5127
## Job054200 TO 4250: Cleaning and Building Service Occupations 0.2379
## Job054300 TO 4430: Entertainment Attendants and Related Workers 0.0926
## Job054500 TO 4650: Personal Care and Service Workers 0.2699
## Job054700 TO 4960: Sales and Related Workers 0.1190
## Job05500 TO 950: Management Related Occupations 0.0082
## Job055000 TO 5930: Office and Administrative Support Workers 0.0866
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations 0.6864
## Job056200 TO 6940: Construction Trade and Extraction Workers 0.8830
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers 0.4816
## Job057700 TO 7750: Production and Operating Workers 0.4325
## Job057800 TO 7850: Food Preparation Occupations 0.7612
## Job057900 TO 8960: Setters, Operators and Tenders 0.4839
## Job059000 TO 9750: Transportation and Material Moving Workers 0.5667
## Job059990: Uncodeable 0.9664
## MotherEd 0.0242
## FatherEd 0.2660
## FamilyIncome78 0.1290
## bmi 0.3106
## log_income 0.0058
##
## (Intercept) ***
## Gendermale **
## Education05 ***
## Job0510 TO 430: Executive, Administrative and Managerial Occupations **
## Job051000 TO 1240: Mathematical and Computer Scientists *
## Job051300 TO 1560: Engineers, Architects, Surveyers, Engineering and Related Technicians
## Job051600 TO 1760: Physical Scientists
## Job051800 TO 1860: Social Scientists and Related Workers
## Job051900 TO 1960: Life, Physical and Social Science Technicians
## Job052000 TO 2060: Counselors, Sociala and Religious Workers
## Job052100 TO 2150: Lawyers, Judges and Legal Support Workers
## Job052200 TO 2340: Teachers
## Job052400 TO 2550: Education, Training and Library Workers
## Job052600 TO 2760: Entertainers and Performers, Sports and Related Workers **
## Job052800 TO 2960: Media and Communications Workers
## Job053000 TO 3260: Health Diagnosing and Treating Practitioners *
## Job053300 TO 3650: Health Care Technical and Support Occupations
## Job053700 TO 3950: Protective Service Occupations *
## Job054000 TO 4160: Food Preparation and Serving Related Occupations
## Job054200 TO 4250: Cleaning and Building Service Occupations
## Job054300 TO 4430: Entertainment Attendants and Related Workers .
## Job054500 TO 4650: Personal Care and Service Workers
## Job054700 TO 4960: Sales and Related Workers
## Job05500 TO 950: Management Related Occupations **
## Job055000 TO 5930: Office and Administrative Support Workers .
## Job056000 TO 6130: Farming, Fishing and Forestry Occupations
## Job056200 TO 6940: Construction Trade and Extraction Workers
## Job057000 TO 7620: Installation, Maintenance and Repairs Workers
## Job057700 TO 7750: Production and Operating Workers
## Job057800 TO 7850: Food Preparation Occupations
## Job057900 TO 8960: Setters, Operators and Tenders
## Job059000 TO 9750: Transportation and Material Moving Workers
## Job059990: Uncodeable
## MotherEd *
## FatherEd
## FamilyIncome78
## bmi
## log_income **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 2375 degrees of freedom
## Multiple R-squared: 0.118, Adjusted R-squared: 0.104
## F-statistic: 8.58 on 37 and 2375 DF, p-value: <2e-16
model2 <- lm(pc1_esteem ~ Gender+ Education05+MotherEd+log_income, rgr.data)
summary(model2)##
## Call:
## lm(formula = pc1_esteem ~ Gender + Education05 + MotherEd + log_income,
## data = rgr.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.530 -0.991 -0.005 1.013 2.605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.40594 0.16194 51.91 < 2e-16 ***
## Gendermale 0.12546 0.05011 2.50 0.01235 *
## Education05 0.10840 0.01094 9.91 < 2e-16 ***
## MotherEd 0.04592 0.01049 4.38 1.3e-05 ***
## log_income 0.03169 0.00856 3.70 0.00022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.22 on 2408 degrees of freedom
## Multiple R-squared: 0.0842, Adjusted R-squared: 0.0827
## F-statistic: 55.4 on 4 and 2408 DF, p-value: <2e-16
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).
In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.
We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.
We first read the data using data.table::fread() which is a faster way to read in big data than read.csv().
Summary and transformation
brca <- fread("data/brca_subtype.csv")
# get the sub-type information
brca_subtype <- brca$BRCA_Subtype_PAM50
brca <- brca[,-1]
table(brca_subtype)## brca_subtype
## Basal Her2 LumA LumB
## 208 91 628 233
b) Randomly pick 5 genes and plot the histogram by each sub-type.
set.seed(10)
brca_sample_idx <- sample(ncol(brca), 5)
basal_indices <- which(brca_subtype == "Basal")
her2_indices <- which(brca_subtype == "Her2")
luma_indices <- which(brca_subtype == "LumA")
lumb_indices <- which(brca_subtype == "LumB")
brca[basal_indices,] %>%
select(all_of(brca_sample_idx)) %>% # select column by index
pivot_longer(cols = everything()) %>% # for facet(0)
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(aes(fill = name)) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(legend.position = "none")brca[her2_indices,] %>%
select(all_of(brca_sample_idx)) %>% # select column by index
pivot_longer(cols = everything()) %>% # for facet(0)
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(aes(fill = name)) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(legend.position = "none")brca[luma_indices,] %>%
select(all_of(brca_sample_idx)) %>% # select column by index
pivot_longer(cols = everything()) %>% # for facet(0)
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(aes(fill = name)) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(legend.position = "none")brca[lumb_indices,] %>%
select(all_of(brca_sample_idx)) %>% # select column by index
pivot_longer(cols = everything()) %>% # for facet(0)
ggplot(aes(x = value, y = ..density..)) +
geom_histogram(aes(fill = name)) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(legend.position = "none")c) Remove gene with zero count and no variability. Then apply logarithmic transform.
require(caret)
dim(brca)## [1] 1160 19947
# remove genes with 0 counts
sel_cols <- which(colSums(abs(brca)) != 0)
brca_sub <- brca[, sel_cols, with=F]
dim(brca_sub)## [1] 1160 19669
# remove genes with no variability (SD=0)
# after removing 0 counts, there are no genes/cols with all same values
# brca_sub[,-nearZeroVar(brca_sub)]
dim(brca_sub)## [1] 1160 19669
# log
brca_sub <- log2(as.matrix(brca_sub+1e-10))brca_subtype and the cluster labels.system.time({brca_sub_kmeans <- kmeans(x = brca_sub, 4)}) ## user system elapsed
## 13.195 0.268 13.585
# save the results as RDS
saveRDS(brca_sub_kmeans, "data/brca_kmeans.RDS")
# read in tcga_sub_kmeans
brca_sub_kmeans <- readRDS("data/brca_kmeans.RDS")
# discrepancy table
table(brca_subtype, brca_sub_kmeans$cluster)##
## brca_subtype 1 2 3 4
## Basal 1 17 3 187
## Her2 39 9 26 17
## LumA 392 82 154 0
## LumB 98 22 111 2
Spectrum clustering: to scale or not to scale?
irlba::irlba().require("irlba")
# center and scale the data
brca_sub_scaled_centered <- scale(as.matrix(brca_sub), center = T, scale = T)
svd_ret <- irlba::irlba(brca_sub_scaled_centered, nv = 10)
names(svd_ret)## [1] "d" "u" "v" "iter" "mprod"
# Approximate the PVE
svd_var <- svd_ret$d^2/(nrow(brca_sub_scaled_centered)-1)
pve_apx <- svd_var/ncol(brca)
plot(pve_apx, type="b", pch = 19, frame = FALSE)We may look at the scree plot of PVE’s and apply elbow rules: take the number of PC’s when there is a sharp drop in the scree plot. We can either choose 2 or 4 PC, in this case, to capture more cumnulative explained variance, we choose 4 PC.
b) Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering propose? Why? (Hint: to put plots side by side, use `gridExtra::grid.arrange()` or `ggpubr::ggrrange()` or `egg::ggrrange()` for ggplots; use `fig.show="hold"` as chunk option for base plots)
require("gridExtra")
require("grid")
# get pc score
pc_score <- brca_sub_scaled_centered %*% svd_ret$v[, 1:3]
# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)
p1 <- data.table(x = pc_score[,1],
y = pc_score[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("A", "B", "C", "D"),
values = c(1, 2, 3, 4)) +
theme_bw() +
labs(color = "Cancer type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2") +
ggtitle("Centered and Scaled")
brca_sub_centered <- scale(as.matrix(brca_sub), center = T, scale = F)
svd_ret <- irlba::irlba(brca_sub_centered, nv = 10)
# Approximate the PVE
svd_var <- svd_ret$d^2/(nrow(brca_sub_centered)-1)
pve_apx <- svd_var/ncol(brca) # plot also shows 4 PC by elbow method
pc_score <- brca_sub_centered %*% svd_ret$v[, 1:3]
# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)
p2 <- data.table(x = pc_score[,1],
y = pc_score[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("A", "B", "C", "D"),
values = c(1, 2, 3, 4)) +
theme_bw() +
labs(color = "Cancer type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2") +
ggtitle("Centered")
grid.arrange(p1, p2, nrow = 1)In our case, scaling does not seem to be necessary, since both plots maintain similar cluster shapes. However, in general, clusterings is scale-sensitive when dealing with features of different units, because the Euclidean distance algorithm will weigh variables with higher numbers more. Since in our case, there is only one data type, scaling is not that necessary.
Spectrum clustering: center but do not scale the data
k.values <- 1:10
# function to compute total within-cluster sum of square
wss <- function(df, k) {
kmeans(df, k, nstart = 10)$tot.withinss
}
# extract wss for 1:10 clusters
wss_values <- map_dbl(k.values, function(k) wss(svd_ret$v[, 1:3], k))
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")Elbow rule agrees that a good number of clusters is 4.
b) Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.
p2 <- data.table(x = pc_score[,1],
y = pc_score[,2],
col = as.factor(brca_subtype),
cl = as.factor(kmean_ret$cluster)) %>%
ggplot() +
geom_point(aes(x = x, y = y, col = col, shape = cl)) +
scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
values = scales::hue_pal()(4)) +
scale_shape_manual(labels = c("A", "B", "C", "D"),
values = c(1, 2, 3, 4)) +
theme_bw() +
labs(color = "Cancer type", shape = "Cluster") +
xlab("PC1") +
ylab("PC2") +
geom_point(aes(x=kmean_ret$centers[1,1], y=kmean_ret$centers[1,2]), colour="black") +
geom_point(aes(x=kmean_ret$centers[2,1], y=kmean_ret$centers[2,2]), colour="black") +
geom_point(aes(x=kmean_ret$centers[3,1], y=kmean_ret$centers[3,2]), colour="black") +
geom_point(aes(x=kmean_ret$centers[4,1], y=kmean_ret$centers[4,2]), colour="black")
p2Clustering result compared to the real sub-type is good for one of the sub-types, but the other three are very hard to distinguish.
c) Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?
table(brca_subtype, kmean_ret$cluster)##
## brca_subtype 1 2 3 4
## Basal 3 17 187 1
## Her2 27 9 26 29
## LumA 161 91 0 376
## LumB 108 22 2 101
Compared to the results from applying kmeans to the original data, PCA helps a little bit, with more distinguishability in the 4 clusters. PCA reduces the dimensionality of the data but still retains the information and variance of the data, which means that kmeans will take less time to train. Additionally, PCA helps whiten the data and normalize/scale it, and since kmeans is sensitive to these aspects, performance will be increased.
d) Now we have an x patient with breast cancer but with unknown sub-type. We have this patient's mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered) Plot this patient in the plot in iv) with a black dot. Calculate the Euclidean distance between this patient and each of centroid of the cluster. Can you tell which sub-type this patient might have?
#unscaled data
pca_unscaled <- prcomp(brca_sub, center = T, scale. = F)
pca_unscaled$rotation<- pca_unscaled$rotation[, 1:20]
pca_unscaled$x <- pca_unscaled$x[, 1:20]
pve_unscaled <- summary(pca_unscaled)$importance[2, 1:10]
kmeans_unscaled <- kmeans(x = pca_unscaled$x[,1:4], 4)
df_unscaled<-(cbind.data.frame(PC1=pca_unscaled$x[,1],
PC2=pca_unscaled$x[,2],
brca_subtype,
cluster=as.factor(kmeans_unscaled$cluster)))
x_patient <- fread("data/brca_x_patient.csv")
dim(x_patient)## [1] 1 19947
# remove genes with 0 counts
sel_cols <- which(colSums(abs(x_patient)) != 0)
x_patient_sub <- x_patient[, sel_cols, with=F]
dim(x_patient_sub)## [1] 1 17193
# log
x_patient_sub <- log2(as.matrix(x_patient_sub+1e-10))
x_patient_sub <- scale(as.matrix(x_patient_sub),center=T,scale=F)
pc1_loadings<-as.data.frame(pca_unscaled$rotation[,1])
pc2_loadings<-as.data.frame(pca_unscaled$rotation[,2])
x_pc1<-sum(pc1_loadings*x_patient_sub)
x_pc2<-sum(pc2_loadings*x_patient_sub)ggplot(df_unscaled, aes(x=PC1,y=PC2,col=brca_subtype,shape=cluster))+
geom_point()+
geom_point(x=x_pc1,y=x_pc2,size=10)table(brca_subtype, kmeans_unscaled$cluster)##
## brca_subtype 1 2 3 4
## Basal 187 1 3 17
## Her2 28 27 27 9
## LumA 0 376 161 91
## LumB 2 99 110 22
centroid1_dist<-sqrt((kmeans_unscaled$centers[1,1]-x_pc1)^2 + (kmeans_unscaled$centers[1,2]-x_pc2)^2)
centroid2_dist<-sqrt((kmeans_unscaled$centers[2,1]-x_pc1)^2 + (kmeans_unscaled$centers[2,2]-x_pc2)^2)
centroid3_dist<-sqrt((kmeans_unscaled$centers[3,1]-x_pc1)^2 + (kmeans_unscaled$centers[3,2]-x_pc2)^2)
centroid4_dist<-sqrt((kmeans_unscaled$centers[4,1]-x_pc1)^2 + (kmeans_unscaled$centers[4,2]-x_pc2)^2)
dists<-rbind(c("cluster1 dist","cluster2 dist","cluster3 dist","cluster4 dist"),
c(centroid1_dist,centroid2_dist,centroid3_dist,centroid4_dist))
dists## [,1] [,2] [,3]
## [1,] "cluster1 dist" "cluster2 dist" "cluster3 dist"
## [2,] "304.376612038529" "89.4657826434346" "256.553961436401"
## [,4]
## [1,] "cluster4 dist"
## [2,] "290.787367635422"
The new patient is closest to cluster 2, which is most likely to be the LumA subtype.
This question utilizes the Auto dataset from ISLR. The original dataset contains 408 observations about cars. It is similar to the CARS dataset that we use in our lectures. To get the data, first install the package ISLR. The Auto dataset should be loaded automatically. We’ll use this dataset to practice the methods learn so far. Original data source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg
Get familiar with this dataset first. Tip: you can use the command ?ISLR::Auto to view a description of the dataset.
#load libraries
library(ggplot2)
library(GGally)
autoraw <- Auto[, c(1, 2, 3,4,5,6)]
library(ISLR)
?ISLR::Auto
dim(Auto)## [1] 392 9
str(Auto)## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
head(Auto)## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
## mpg cylinders displacement horsepower weight
## Min. : 9.0 Min. :3.00 Min. : 68 Min. : 46.0 Min. :1613
## 1st Qu.:17.0 1st Qu.:4.00 1st Qu.:105 1st Qu.: 75.0 1st Qu.:2225
## Median :22.8 Median :4.00 Median :151 Median : 93.5 Median :2804
## Mean :23.4 Mean :5.47 Mean :194 Mean :104.5 Mean :2978
## 3rd Qu.:29.0 3rd Qu.:8.00 3rd Qu.:276 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.6 Max. :8.00 Max. :455 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.0 Min. :70 Min. :1.00 amc matador : 5
## 1st Qu.:13.8 1st Qu.:73 1st Qu.:1.00 ford pinto : 5
## Median :15.5 Median :76 Median :1.00 toyota corolla : 5
## Mean :15.5 Mean :76 Mean :1.58 amc gremlin : 4
## 3rd Qu.:17.0 3rd Qu.:79 3rd Qu.:2.00 amc hornet : 4
## Max. :24.8 Max. :82 Max. :3.00 chevrolet chevette: 4
## (Other) :365
Explore the data, with particular focus on pairwise plots and summary statistics. Briefly summarize your findings and any peculiarities in the data.
pairs(Auto)Auto$originf <- as.factor(Auto$origin)
Auto$yearf <- as.factor(Auto$year)
Auto$cylf <- as.factor(Auto$cylinders)
#Auto["originf"][Auto["originf"] == 1] <- "American"
#Auto["originf"][Auto["originf"] == 2] <- "European"
#Auto["originf"][Auto["originf"] == 3] <- "Japanese"
Auto %>% ggplot() + aes(x = weight, y = mpg, col = originf) + geom_point() +
geom_smooth(method='lm', formula= y~x)Auto %>% ggplot() + aes(x = year, y = mpg, col = originf) + geom_point() +
geom_smooth(method='lm', formula= y~x)Auto %>% ggplot() + aes(x = horsepower, y = mpg, col = originf) + geom_point() +
geom_smooth(method='lm', formula= y~x)Auto %>%
ggplot(aes(x=weight, y=mpg, group = originf)) +
geom_point()+
geom_smooth(method="lm", formula=y~x, se=F,color = "red")+
facet_wrap(~origin) +
theme_bw()Auto %>%
ggplot(aes(x=weight, y=mpg, group = year, col = originf)) +
geom_point()+
geom_smooth(method="lm", formula=y~x, se=F,color = "black")+
facet_wrap(~year) +
theme_bw()Auto %>%
ggplot(aes(x = yearf, y = mpg, fill = originf)) +
geom_boxplot()Auto %>%
ggplot(aes(x = cylf, y = mpg, fill = originf)) +
geom_boxplot()From the above plots, we can see numerous relationships; however later inspection shows variables to be correlated. It appears mpg increases in year, decreases in weight, displacement, cylinders, and horsepower.
Differences between the three origins are present, even when isolating other influences. Therefore, when building models, we will keep variables and work to eliminate those that are highly correlated and uninformative.
time have on MPG?mpg vs. year and report R’s summary output. Is year a significant variable at the .05 level? State what effect year has on mpg, if any, according to this model.fit1 <- lm(mpg ~ year, data = Auto)
ggplot(Auto, aes(x = year , y = mpg)) +
geom_point() +
geom_smooth(method="lm",se=F) +
geom_hline(aes(yintercept = mean(mpg)), color = "red") summary(fit1) ##
## Call:
## lm(formula = mpg ~ year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.021 -5.441 -0.441 4.974 18.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -70.0117 6.6452 -10.5 <2e-16 ***
## year 1.2300 0.0874 14.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.36 on 390 degrees of freedom
## Multiple R-squared: 0.337, Adjusted R-squared: 0.335
## F-statistic: 198 on 1 and 390 DF, p-value: <2e-16
From fit1, the year variable has a p-value of near 0, so it is significant at the 0.05 threshold. This aligns with comparison of the average and fitted model because one shows no trend and the other shows a clear upward trend.
From the above analysis, the beta value for year was found to be 1.23, so the fit says that for a increase in year, mpg (on average), increases by 1.23mpg.
horsepower on top of the variable year to your linear model. Is year still a significant variable at the .05 level? Give a precise interpretation of the year’s effect found here.fit2 <- lm(mpg ~ year + horsepower, data = Auto)
summary(fit2)##
## Call:
## lm(formula = mpg ~ year + horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.077 -3.078 -0.431 2.588 15.315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.73917 5.34903 -2.38 0.018 *
## year 0.65727 0.06626 9.92 <2e-16 ***
## horsepower -0.13165 0.00634 -20.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.39 on 389 degrees of freedom
## Multiple R-squared: 0.685, Adjusted R-squared: 0.684
## F-statistic: 424 on 2 and 389 DF, p-value: <2e-16
Year remained statistically significant at the 0.05 level, but its coefficient decreased by nearly 50% after the introduction of a horsepower variable.
Now, the interpretation for the year beta coefficient is as follows: with mpg held constant, an increase in year, on average, increases mpg by 0.65.
The 95% CI for i is beta_year = 1.2300 +/- 2(0.0874) The 95% CI for ii is beta_year = 0.65727 +/- 2(0.06626)
This difference arises because of the introduction of another variable, which captures information from the horsepower data. Now, since we now predict based on horespower and year, we dont have to just guess basses on year. With more information, we can tighten our confidence interval on the year coeeficient. Without the additional information from horsepower, we had to have a wider confidence interval because we had less information (variance) to base our model on
lm(mpg ~ year * horsepower). Is the interaction effect significant at .05 level? Explain the year effect (if any).fit3 <- lm(mpg ~ year * horsepower, data = Auto)
summary(fit3)##
## Call:
## lm(formula = mpg ~ year * horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.349 -2.451 -0.456 2.406 14.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.27e+02 1.21e+01 -10.45 <2e-16 ***
## year 2.19e+00 1.61e-01 13.59 <2e-16 ***
## horsepower 1.05e+00 1.15e-01 9.06 <2e-16 ***
## year:horsepower -1.60e-02 1.56e-03 -10.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.9 on 388 degrees of freedom
## Multiple R-squared: 0.752, Adjusted R-squared: 0.75
## F-statistic: 393 on 3 and 388 DF, p-value: <2e-16
The interaction term was found to be significant at the 0.05 level. The effect of year, now with the interaction term, is two-fold. First, holding horsepower constant, increasing year increases mpg by 2.19mpg. The interaction term between year and horsepower comes into play, mpg goes down by 0.016 (horsepower * change in year)
Remember that the same variable can play different roles! Take a quick look at the variable cylinders, and try to use this variable in the following analyses wisely. We all agree that a larger number of cylinders will lower mpg. However, we can interpret cylinders as either a continuous (numeric) variable or a categorical variable.
cylinders as a continuous/numeric variable. Is cylinders significant at the 0.01 level? What effect does cylinders play in this model?fit4 <- lm(mpg ~ cylinders, data = Auto)
summary(fit4)##
## Call:
## lm(formula = mpg ~ cylinders, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.241 -3.183 -0.633 2.549 17.917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.916 0.835 51.4 <2e-16 ***
## cylinders -3.558 0.146 -24.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.91 on 390 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.604
## F-statistic: 597 on 1 and 390 DF, p-value: <2e-16
Treating cylinders as a continuous variable results in cylinders being significant at the 0.01 level. For each incremental cylinder, mpg decreases by 3.558, with a theoretical 0-cylinder car having 42.916 mpg. This is not easily interpreted, but does show a negative relationship.
cylinders as a categorical/factor. Is cylinders significant at the .01 level? What is the effect of cylinders in this model? Describe the cylinders effect over mpg.fit5 <- lm(mpg ~ cylf, data = Auto)
summary(fit5)##
## Call:
## lm(formula = mpg ~ cylf, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.284 -2.904 -0.963 2.344 18.027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.550 2.349 8.75 < 2e-16 ***
## cylf4 8.734 2.373 3.68 0.00027 ***
## cylf5 6.817 3.589 1.90 0.05825 .
## cylf6 -0.577 2.405 -0.24 0.81071
## cylf8 -5.587 2.395 -2.33 0.02015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.7 on 387 degrees of freedom
## Multiple R-squared: 0.641, Adjusted R-squared: 0.638
## F-statistic: 173 on 4 and 387 DF, p-value: <2e-16
Now, only cylinder4 is significant, because the other categories do not reach significance beyond the 0.01 level. To interpret this, the average MPG for a three cylinder car is the intercept, with cars in each other category differing from the intercept by the coefficient of their respective category. The only coefficient that is statistically significant is for cylf4.
cylinders as a continuous and categorical variable in your models?When treating cylinders as a continuous variable, the model estimates a coefficient for each incremental change in cylinder count– meaning that the model fits a straight line.
There are two problems with this. First there are no 1,2 or 7 cylinder cars. This means that cylinders is not really a continuous variable. The second, more important issue, is that the 4th incremental cylinder and 7th/8th incremental cylinders likely have different effects.
mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable at .01 level?anova(fit4, fit5)## Analysis of Variance Table
##
## Model 1: mpg ~ cylinders
## Model 2: mpg ~ cylf
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9416
## 2 387 8544 3 871 13.2 3.4e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value of the anova test was statistically significant (near 0), we succeed in rejecting the null hypothesis that mpg is linear in cylinders and show that it is categorical.
p-val < 0.01
Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.
res <- cor(autoraw)
round(res, 2)## mpg cylinders displacement horsepower weight acceleration
## mpg 1.00 -0.78 -0.81 -0.78 -0.83 0.42
## cylinders -0.78 1.00 0.95 0.84 0.90 -0.50
## displacement -0.81 0.95 1.00 0.90 0.93 -0.54
## horsepower -0.78 0.84 0.90 1.00 0.86 -0.69
## weight -0.83 0.90 0.93 0.86 1.00 -0.42
## acceleration 0.42 -0.50 -0.54 -0.69 -0.42 1.00
knitr::include_graphics("occam.png")From the correlation matrix, we see that displacement and cylinders are strongly correlated (0.95) so the two probably don’t provide much information on-top of each other. Horsepower also appears highly correlated (0.9) with displacement, as is weight (0.93).
Building a model using the philosophy of “Occam’s razor,” “plurality should not be posited without necessity,” we will compare a model using all inputs and one that removes both insignificant ones and variables that are correlated with others.
Furthermore, to preserve interpret ability and avoid possible over-fitting, we will avoid higher-order terms and interaction terms.
model1 <- lm(mpg ~ cylf + displacement + horsepower + weight +
acceleration + year + originf, data = Auto)
summary(model1) ### key output##
## Call:
## lm(formula = mpg ~ cylf + displacement + horsepower + weight +
## acceleration + year + originf, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.680 -1.937 -0.068 1.671 12.776
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.21e+01 4.54e+00 -4.86 1.7e-06 ***
## cylf4 6.72e+00 1.65e+00 4.06 5.9e-05 ***
## cylf5 7.08e+00 2.52e+00 2.81 0.0052 **
## cylf6 3.35e+00 1.82e+00 1.84 0.0670 .
## cylf8 5.10e+00 2.11e+00 2.42 0.0161 *
## displacement 1.87e-02 7.22e-03 2.59 0.0100 **
## horsepower -3.49e-02 1.32e-02 -2.64 0.0087 **
## weight -5.78e-03 6.31e-04 -9.15 < 2e-16 ***
## acceleration 2.60e-02 9.30e-02 0.28 0.7802
## year 7.37e-01 4.89e-02 15.06 < 2e-16 ***
## originf2 1.76e+00 5.51e-01 3.20 0.0015 **
## originf3 2.62e+00 5.27e-01 4.96 1.0e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.1 on 380 degrees of freedom
## Multiple R-squared: 0.847, Adjusted R-squared: 0.842
## F-statistic: 191 on 11 and 380 DF, p-value: <2e-16
model2 <- lm(mpg ~ horsepower + weight +
acceleration + year + originf, data = Auto)
summary(model2) ### key output##
## Call:
## lm(formula = mpg ~ horsepower + weight + acceleration + year +
## originf, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.492 -2.090 -0.008 1.832 13.450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.80e+01 4.63e+00 -3.89 0.00012 ***
## horsepower -4.75e-03 1.32e-02 -0.36 0.71869
## weight -5.66e-03 5.03e-04 -11.25 < 2e-16 ***
## acceleration 3.92e-02 9.77e-02 0.40 0.68857
## year 7.55e-01 5.17e-02 14.62 < 2e-16 ***
## originf2 1.94e+00 5.21e-01 3.72 0.00023 ***
## originf3 2.27e+00 5.26e-01 4.31 2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.34 on 385 degrees of freedom
## Multiple R-squared: 0.819, Adjusted R-squared: 0.817
## F-statistic: 291 on 6 and 385 DF, p-value: <2e-16
model3 <- lm(mpg ~ weight + year + originf, data = Auto)
summary(model3) ### key output##
## Call:
## lm(formula = mpg ~ weight + year + originf, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.603 -2.113 -0.021 1.762 13.526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.30694 4.01724 -4.56 7.0e-06 ***
## weight -0.00589 0.00026 -22.65 < 2e-16 ***
## year 0.76985 0.04867 15.82 < 2e-16 ***
## originf2 1.97631 0.51797 3.82 0.00016 ***
## originf3 2.21453 0.51882 4.27 2.5e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.34 on 387 degrees of freedom
## Multiple R-squared: 0.819, Adjusted R-squared: 0.817
## F-statistic: 438 on 4 and 387 DF, p-value: <2e-16
model4 <- lm(mpg ~ weight * year + originf, data = Auto)AIC(model1, model2, model3, model4)## df AIC
## model1 13 2013
## model2 8 2067
## model3 6 2064
## model4 7 2010
BIC(model1, model2, model3, model4)## df BIC
## model1 13 2064
## model2 8 2099
## model3 6 2088
## model4 7 2038
For parsimony, we select model3 with just three variables: weight, year, and origin. From further inspection of the residual plots, there appears to be limited gain from the complexity of model1 relative to model3.
A model with an interaction term for year * weight, the story being fuel efficiencies could be non-linear in more advanced cars, especially for heavier ones, but this model, from inspection of BIC, AIC, and the plots below has limited benefit.
res3 <- resid(model3)
plot(fitted(model3), res3)
abline(0,0)res1 <- resid(model1)
plot(fitted(model1), res1)
abline(0,0)res4 <- resid(model4)
plot(fitted(model4), res4)
abline(0,0)qqnorm(res3)
qqline(res3) qqnorm(res4)
qqline(res4) From the above plots, there appear to be non-normal residual distributions on the upper tail, the upper tail of the q-q plot is fat. From inspection of the plot of residuals for model3, there appears to be more variance at higher fitted values.
There appears to be relative homoskedacicity in the middle of fitted values, but with heteroskedacity at relative outlier observations at both tails.
model4, with the interaction term,
With the selected model, model3, the effects found are as follows:
mpg is increasing with repect to year mpg decreases with additional weight
Cars made in Europe get a slight boost relative to American cars Cars made in Japan get a slight boost relative to American cars
mpg of the following car: A red car built in the US in 1983 that is 180 inches long, has eight cylinders, displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of 260. Also give a 95% CI for your prediction.newcar <- Auto[1, ] # Create a new row with same structure as in data1
newcar[1] <- "NA" # Assign features for the new car
newcar$origin <- 1
newcar$cylinders <- 8
newcar$displacement <- 350
newcar$horsepower <- 260
newcar$weight <- 4000
newcar$year <- 83
predict(model3,newcar,interval="prediction",se.fit=T) ## $fit
## fit lwr upr
## 1 22 15.4 28.7
##
## $se.fit
## [1] 0.484
##
## $df
## [1] 387
##
## $residual.scale
## [1] 3.34
predict(model3,newcar,interval="confidence",se.fit=T) ## $fit
## fit lwr upr
## 1 22 21.1 23
##
## $se.fit
## [1] 0.484
##
## $df
## [1] 387
##
## $residual.scale
## [1] 3.34
From the above results, the predicted fuel efficiency is 22 mpg with a 95% prediction interval between 15.4 and 28.7.
The confidence interval, 95% is between 21.1 and 23mpg.
This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.
Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).
We can create a sample input vector (\(n = 40\)) for \(\mathbf{x}\) with the following code:
# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)Create a corresponding output vector for \(\mathbf{y}\) according to the equation given above. Use set.seed(1). Then, create a scatterplot with \((x_i, y_i)\) pairs. Base R plotting is acceptable, but if you can, please attempt to use ggplot2 to create the plot. Make sure to have clear labels and sensible titles on your plots.
Find the LS estimates of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\), using the lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates look to be good?
What is your RSE for this linear model fit? Is it close to \(\sigma = 2\)?
What is the 95% confidence interval for \(\boldsymbol{\beta}_1\)? Does this confidence interval capture the true \(\boldsymbol{\beta}_1\)?
Overlay the LS estimates and the true lines of the mean function onto a copy of the scatterplot you made above.
Provide residual plot where fitted \(\mathbf{y}\)-values are on the x-axis and residuals are on the y-axis.
Provide a normal QQ plot of the residuals.
Comment on how well the model assumptions are met for the sample you used.
This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.
Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. We include some sample code below, which should guide you in setting up the simulation. Note: this code is easier to follow but suboptimal; see the appendix for a more optimal R-like way to run this simulation.
# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40)
n_sim <- 100 # number of simulations
b1 <- 0 # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0 # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0 # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38) # Food for thought: why 38 instead of 40? What is t_star?
# Perform the simulation
for (i in 1:n_sim){
y <- 1 + 1.2 * x + rnorm(40, sd = 2)
lse <- lm(y ~ x)
lse_output <- summary(lse)$coefficients
se <- lse_output[2, 2]
b1[i] <- lse_output[2, 1]
upper_ci[i] <- b1[i] + t_star * se
lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))
# remove unecessary variables from our workspace
rm(se, b1, upper_ci, lower_ci, x, n_sim, b1, t_star, lse, lse_out) Summarize the LS estimates of \(\boldsymbol{\beta}_1\) (stored in results$b1). Does the sampling distribution agree with theory?
How many of your 95% confidence intervals capture the true \(\boldsymbol{\beta}_1\)? Display your confidence intervals graphically.